Match Markdown links with advanced regex features

Here is the trick: explicitly look for space using [ ], between square brackets.

Let’s update our regular expression with indentation, line returns and comments:[ (?<text>.

+) # Text group]( (?<url>[^ ]+) # URL group (?: # Group matching title with space and double quotes around [ ] # Space separating URL and optional title " (?<title>.

+) # Title group " )? # Title group is optional)Our regular expression is now much easier to read and maintain.

2.

Fix title matching with a look behind expressionDouble quotes enclose titles (eg.

"title").

But a title can also contain double quotes if they are escaped with a backslash: "This is "my title "".

A title string is therefore defined as a string of:Non double quotes characters.

Or a double quotes characted that is preceded by an antislash (escaped double quotes).

We are going to use a positive look behind expression to find escaped double quotes.

➡️ Syntax of positive look behind expression: (?<=foo)bar (finds "bar" only if preceded by "foo").

In our case, finding escaped double quotes will look like this: (?<=)" (the backslash has to be escaped itself in the regex, hence the double backslash).

We can now replace the title group by:(?<title>(?:[^"]|(?<=)")*?)The whole regular expression becomes:[ (?<text>.

+) # Text group]( (?<url>[^ ]+) # URL group (?: # Optional group matching title with space and double quotes around [ ] # Space separating URL and optional title " (?<title> # Title group (?: [^"]|(?<=)" )*? ) " )?)A link like this will now be correctly matched: [My text](http://www.

example.

com "My "title"").

Titles containing escaped double quotes are now matched3.

Use recursive regular expressions for complex text matchingOur current regular expression works well so far, but not in all cases.

Let’s consider the following markdown link, which is valid too: [Link text with [brackets] inside](http://www.

example.

com).

????.The text part now contains some brackets in its content, something that our current regex doesn’t handle!We are looking for encapsulated groups consisting strings with the following form: [ .

].

This is the exact definition of recursion!We will use the special syntax (?&name_of_block) to look recursively for the name_of_block block pattern.

So here is our final regular expression to match a Markdown link:(?<text_group> # Text group, including square brackets [ (?> # (?> defines an atomic group, this is a performance improvement when using recursion [^[]]+ # Look for any char except closing square bracket |(?&text_group) # OR: find recursively an other pattern with opening and closing square brackets )* ])(?: ( (?<url>S*?) # URL: non-greedy non-whitespace characters (?: [ ] " (?<title> (?:[^"]|(?<=)")*? # Title without double quotes around ) " )? ))We now have a regular expression that matches Markdown links, extracts their data and that’s easy to read!.????Here is the final schema for the regular expression:See the recursion on the left that matches textOriginally published at blog.

michaelperrin.

fr on February 4, 2019.

.

. More details

Leave a Reply