Iterating Strings with Regex in JavaScript

Demonstrating matchAll by reformatting Markdown with named capture groups

Regex Iteration is the key tool for reformatting content for modern apps. The ability to re-format content via the combination of regular expressions and iteration is a powerful concept that provides a range of benefits for application builders, and ultimately the end user experience. So much so, that is is now built into JavaScript with the relatively newly introduced matchAll function — a function this article explores in detail.

This article builds upon a previous introductory piece to regular expressions in JavaScript. If you’re new to exploring Regex and would like to follow the concepts of this piece that build upon the foundations, check out this article first: Regular Expressions in JavaScript: An Introduction.

So why is matchAll’s iteration abilities in conjunction with Regex a big deal? Well, let’s take a look at some high level use cases of today, before delving into the code in more detail.

Why is iteration with Regex important — and useful?

If you need to reformat text content from one format to another — such as taking Markdown or HTML and transforming it into readable text, or even mobile components — then Regex iteration is the solution you need to process that transformation.

Take some common scenarios that have become common in modern app and web development:

  • Reformatting legacy content. Think taking an old archive of HTML formatted articles to render inside a React Native application with native components. As React Native is a library of components such as <View> and <Text>, and not tags, those HTML tags within the article need to be replaced with these components. This requires a total transformation of that content, from a string to an array of components. Sure, you could simply render the HTML content from a webpage in a WebView component, but this is a far inferior solution (in all aspects) than native integration.
  • Expanding metadata. Being able to tap or click certain text content to bring up more metadata about the text in question — a powerful concept. Stock trading publications do this by wrapping stock indices with API endpoints for fetching the live price. Dictionary apps do this by wrapping certain words with definitions and other meta-characteristics. This can be done with HTML, Markdown, or even in-house formatting rules designed to be transformed via Regex.
  • Formatting character-based languages such as Mandarin or Kanji. For example, it is common for written content in Chinese to display Pinyin above each character in educational settings, or even just the tone of the character.
  • Mathematical equations. It is hard to display complex equations with simple text input. Iterating through a bulk of text and reformatting the equations into the standard scientific format is a great use case of iterative regex.

The above scenarios entail quite dramatic changes from the original content to the transformed content, oftentimes being a totally different format, such as the case with transforming the string into an array of components for mobile apps.

Let’s next look at how this is achieved, with the matchAll JavaScript API.

Introduction to Iteration with matchAll

matchAll is a Regex-centric function that matches a string against a regular expression, and returns these matches as an iterable. It requires a regular expression as its only argument.

An “iterable” is a piece of data that can be iterated through. Arrays, maps, sets, and strings are built-in iterables of JavaScript. Unlike matchAll, match simply returns complete matches that are conventionally indexed in an array.

matchAll was introduced in the 2019 edition of the ECMA Script standard, and has seen modest browser support since its inception. At the time of writing, Edge, IE and Safari on iOS still lag behind supporting matchAll.

If you require universal browser support however, there are polyfills available. Node JS support is also built in from version 12 and above. With the API here to stay, now is a great time to explore its capabilities.

Once matchAll is executed on a string, an iterator is returned that allows us to loop through the resulting matching items:

const matches = myString.matchAll(regex);for (const match of matches) {
...
}

Concretely, matchAll returns an iterable containing the resulting matches against the supplied regular expression. The contents of each match however is dependant on the regular expression — let’s explore this further.

How a Match is Structured

Each match is an array that contains an arbitrary number of elements depending on the supplied Regex of matchAll:

  • The first element at index 0 contains the entire matching text.
  • The subsequent elements contain the individual matches of each capture group of the regular expression — if capture groups exist.
  • A match.index property is also given, providing the position in the string where the match began.
  • Where named capture groups are present in a regular expression, match groups those, too, in the match.groups property. We’ll use named capture groups further down, as it requires more syntax within the regular expression itself. The following explanation uses the default numbered capture groups, where each group is simply indexed as array elements.

To illustrate match with a minimal example, let’s take a regular expression with two capture groups, that will match any text within two square bracket enclosures:

// a regular expression testing two capture groupsconst regexp = /match \[(.*?)\]\[(.*?)\]/ig;const myString = "Testing a match [group 1] [group 2]";

The bolded portion of the string will be the only match of this example; we can simply expect one match in the resulting iterable. Let’s run matchAll on this string now, and extract each piece of data from the resulting match:

const matchAll = myString.matchAll(regexp);for (const match of matches) {

// the complete match
const fullMatch = match[0];

// isolating each capture group match
const group1 = match[1];
const group2 = match[2];
// index of where the match starts
const cursorPos = match.index;
}

I’ve termed the index property of the match cursorPos, so there is no confusion between an array index and the index at which the match was found. You can think of match.index as being the position a cursor would be before the matched result is typed out. cursorPos will become important further down when we reformat an entire bulk of text from Markdown into HTML.

Logging out each match index will give us the corresponding values:

console.log(match[0]);
> match [group 1] [group 2]
console.log(match[1]);
> group 1
console.log(match[2]);
> group 2
console.log(cursorPos);
> 10

More data can now be derived from these values. For example, match[0].length and cursorPos can be used together to calculate the string position where the match ends:

const matchEndPos = cursorPos + match[0].length;

This is vital for reformatting entire bulks of text where you also need the content that isn’t matched —the “in-between” content that exists between matches. This will be demonstrated later in the article.

Advantages of the iterator

Beyond the additional data supplied by each match, and the simple syntax of calling matchAll, there are other fundamental advantages that using iterators bring:

  • Iterators do not need to be indexed, and can be of any data type that conforms to the iterable protocol, that range from strings, arrays, sets and maps.
  • Iterators work great with the for…of loop, the newest member of the for loop family of JavaScript. This makes for minimal syntax to loop through potentially complex objects. The more capture groups your Regex contains, the more complex each matching element will be.
  • The ability to refer to named capture groups.

Let’s touch on the last point next — where the power of matchAll really becomes apparent, when we work with named capture groups within match.

Named Capture Groups within `match`

The previous example highlighted how match automatically indexes each capture group within its resulting array. This is very useful, but in the event we’re dealing with large regular expressions with many capture groups, working with many array elements will become confusing.

Let’s go one step further by utilising named capture groups.

Fundamentally, match also groups each named capture groups in a separate property: match.groups. Let’s modify the previous regular expression slightly to name the two capture groups. Let’s call them mygroup and anothergroup:

const myString = "Testing a match [group 1] [group 2]";const regexp = /match \[(?<mygroup>.*?)\]\[(?<anothergroup>.*?)\]/ig;

The bolded syntax above “tags” or “names” a capture group. To name a capture group, the question mark ? immediately follows the opening of the group, followed by the name of the group in angle brackets. The regex follows immediately after the closing angle bracket.

The ? in this case is does not mean “optional”. Indeed, there are no characters to test before it, as it appears directly after the opening of the capture group.

With names now hardcoded in the Regex itself, we can now access them when iterating through each match:

for (const match of matches) {

// accessing groups via destructuring `match`
const { groups: { mygroup, anothergroup }, index } = match;
console.log(mygroup);
> group 1
console.log(anothergroup);
> group 2
}

Notice the great destructuring syntax used here to get each group match, plus match.index, in one line of code. mygroup and anothergroup become separate constants that can now be used for further processing within the iteration in question.

Let’s now take a look at a real-world example of using matchAll, to reformat a bulk of Markdown text.

Matching Markdown Rules with matchAll

In this section we will take matchAll to the next level and test multiple Markdown rules in one regular expression, with each of those rules having named capture groups. We will then test which rule was matched within the iteration in question, and format the text accordingly.

Combining markdown rules with |

In order to test multiple Markdown rules in a singular Regex, the vertical bar (|) — also known as the alternation operator — can be used to act as an “or” operator. It can be used on individual characters, character classes, and capture groups.

Take the following regular expression, that tests for bold text, italic text, and links, with fully configured named capture groups. For simplicity, I have omitted testing asterisk characters, that also represent bold and italic text in markdown in addition to the underscore:

// either match bold, italic or link formatted markdownconst regex = /(__(?<bold>.*?)__)|(_(?<italic>.*?)_)|(?<link>\[(?<text>[\w\s\d]+)\]\((?<url>https?:\/\/([a-z0-9@#/.-]+))\))/ig

The Regex testing a Markdown link has been taken from my introductory article on Regex. Note that the i and g flags are very commonly used for testing bulks of text where more than one case insensitive matches need to be tested.

Here are the three rules broken down further:

// bold text - __text__
(__(?<bold>.*?)__)
|// italic text - _text_
(_(?<italic>.*?)_)
|// link - [link text](url...)
(?<link>\[(?<text>[\w\s\d]+)\]\((?<url>https?:\/\/([a-z0-9@#/.-]+))\))

Each markdown rule is wrapped within its own capture group, with an additional named capture group surrounding the content we are interested in reformatting.

All these groups will become accessible in each match result, with the un-named groups accessible through match elements, and named groups accessible via match.groups. Where each group is unmatched, a value of undefined is assigned.

Notice too that the <link> group also has two other groups within it — <text> and <url>. These will also be present in match.groups.

Let’s go ahead and test a string now, that will contain one of each Markdown rule we are testing for:

// matching a string with three Markdown rulesconst str = "Testing some __bold text__ and _italic text_ with my Medium link: [Here](https://medium.com/@rossbulat)";let matches = str.matchAll(regex);for (let match of matches) {
console.log(match.groups);
}

Running this will demonstrate that each of the rules are successfully matched. Let’s take a look at match.groups when the link is matched:

// `match.groups` of a matched Markdown linkconsole.log(match.groups);>
{
bold: undefined,
italic: undefined,
link: "[Here](https://medium.com/@rossbulat)",
text: "Here",
url: "https://medium.com/@rossbulat"}
}

Notably, all capture groups are listed in a one-dimensional object —groups embedded in other groups, such as <url> and <text> within the <link> capture group, are not treated differently in match.groups.

We now have the ability to reformat a bulk of Markdown text. Let’s demonstrate this in the final section of this piece.

Reformatting Markdown into HTML

The following gist takes the three markdown rules from above and reformats a Markdown string into HTML:

formattedStr is initialised as an empty string, with reformatted matches as well as non-matched content being appended to it as the iteration continues.

Some key points about this implementation:

  • Destructure syntax has again been used to extract each group from matches.groups — a handy shortcut that simplifies syntax. We can now reference each group further down when we determine what rule was matched:
// destructuring `match.groups`const { groups: { bold, italic, link, text, url }, index } = match;
  • cursorPos keeps a record of where each match ends. Within the next iteration, match.index will supply the match starting position. With both these data points, we can fetch the substring of the content between matches. This is the first thing we do upon each iteration, appending the content between matches:
// append string content from the last matchreformattedStr += myStr.substr(cursorPos, (index - cursorPos));
  • We have leveraged both match.groups to fetch particular Markdown content, as well as indexed capture groups, such as match[0], to fetch the entire match content. match[0] is important to calculate the length of its content in relation to the original string.
  • The HTML reformatting and appending to reformattedStr is simply implemented as a conditional statement, testing whether each group is undefined, and reformatting accordingly when a group value exists:
// test each rule and append to reformatted stringif (bold !== undefined) {
reformattedStr += '<b>' + bold + '</b>';
}
else if (italic !== undefined) {
reformattedStr += '<i>' + italic + '</i>';
}
else if (link !== undefined) {
reformattedStr += '<a href="' + url + '">' + text + '</a>';
}
  • Once complete, an additional check is made to append any further content after the last match. If cursorPos is less than the length of the original string, we know there is more content to be appended:
// appending content after last matchif (cursorPos < myStr.length) {
reformattedStr += myStr.substr(cursorPos, (myStr.length - cursorPos));
}

Although we are still constructing a string from another string, this example clearly demonstrates the power and comprehensiveness of matchAll, making our lives as developers easier by referencing clearly grouped match content.

In Summary

This article has built upon my introduction to Regex, demonstrating how to use the newly introduced JavaScript API matchAll to iterate through matches of a Regex test on a string. This is very useful for reformatting data for a range of use cases, some of which were mentioned at the top of this article.

Capture groups are well supported with matchAll, that supply every group match, as well as a separate named capture groups object with every match. In addition to this, the full matching content, as well as an index of where the match began, are also given.

The final example of this piece reformats Markdown into HTML, but we are not limited to reconstructing a string — you could for example reformat a match into a component, or even just an object for further processing, and push each onto an array. Further articles will be linked here for more advanced use cases of matchAll and Regex iteration.

Programmer and Author. Director @ JKRBInvestments.com. Creator of LearnChineseGrammar.com for iOS.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store