home

Using regex with JavaScript

2021-9-21

regex

I'm looking to extract the image source from a string of html.

const htmlString = '<h1>Regex and Javascript</h1><h2>It feels good to progress</h2><img src="https://talktree.me/favicon512.png" alt="highliners"/>'

In this case - I'm looking for https://talktree.me/favicon512.png

Without regex

I can use JS functions like slice() and indexOf() to get the string.

const firstImageBeginning = htmlString.indexOf('&ltimg')
if (firstImageBeginning > -1) {
  const firstImageEnd = htmlString.indexOf('/>', firstImageBeginning)
  const firstImage = htmlString.slice(
		firstImageBeginning + 10, firstImageEnd - 2
	)
}

This wouldn't even work in most cases, I'd need more code.

  • If/then statement has to be used since indexOf() returns -1 when the string is not found.
  • '&ltimg src="' is 10 characters, hence the '+ 10; this wouldn't work with an 'alt tag'.

With regex

It's much easier to use JS's match() function instead.

const wholeImgTag = htmlString.match(/&ltimg.+?src=".+?"/)

So it's looking for '&ltimg' followed by anything until 'src="' followed by anything until '"'

Here is the regex deconstructed:

  • '/' starts and ends the expression
  • '&ltimg' is interpreted literally, it looks for those characters
  • '.' is a special character, it represents any character
  • '+' is also a special character to match the previous character
  • '?' is also a special character for 'if followed by'
  • 'src=" ' and '"' are interpreted literally

Javascript doesn't just return the string with match()

[
  0: "&ltimg alt="highliners" src="https://talktree.me/favicon512.png"",
  groups: undefined,
  index: 63,
  input: "<h1>Regex and Javascript</h1><h2>It feels good to …ners" src="https://talktree.me/favicon512.png" />",
]

The response we normally want is [0]. But in this case, I want the URL and not the whole &ltimg> tag so I'm going to use regex's grouping property ( ).

const wholeImgTag = htmlString.match(/&ltimg.+?src="(.+?)"/)

I added parentheses around the second .+?. Now match() returns:

[
  0: "&ltimg alt="highliners" src="https://talktree.me/favicon512.png"",
  1"https://talktree.me/favicon512.png",
  groups: undefined,
  index: 63,
  input: "<h1>Regex and Javascript</h1><h2>It feels good to …ners" src="https://talktree.me/favicon512.png" />",
]

I can access the exact string I want under the [1] property: https://talktree.me/favicon512.png

If you use the global tag, /g in the regex, match() will return an array with the matched strings and not contain groups, index, or input.

I don't use regex often enough to remember all the rules, but when there is a situation it can be used, I spend the time. The code is more predictable, more concise, and saves me time in the long run.