Practical guides · · 13 min read

Regex: the parts that matter and the parts you can skip

A working guide to regex features you actually need in production. Catastrophic backtracking, ReDoS, when not to use regex (HTML, email validation, balanced parens), and the differences between regex flavors that bite you.

By The Toolsy team

Regex has a worse reputation than it deserves. The reputation comes from people trying to use it for things regex is bad at (parsing HTML, validating email addresses to RFC perfection, matching balanced parentheses) and concluding the whole tool is broken. It isn't — but the cases it's good at are narrower than most tutorials suggest.

This post is a working guide to which regex features you should use, which you should avoid, and how to write regexes that work in production without producing the "modify operator: PRECEDENCE" stack traces beloved by interview problems.

The 80% of regex you actually need

Most production regex problems use a small subset of features. The essentials:

.       any character (except newline by default)
\d      digit (0-9)
\w      word character (letters, digits, underscore)
\s      whitespace (space, tab, newline)
\b      word boundary (zero-width)

*       zero or more
+       one or more
?       zero or one
{3}     exactly 3
{3,}    3 or more
{2,5}   2 to 5

^       start of string (or line, with multiline flag)
$       end of string (or line, with multiline flag)
[abc]   character class — any of a, b, c
[^abc]  negated class — anything except a, b, c
[a-z]   range
|       alternation: a|b matches a or b
()      capture group
(?:..)  non-capture group

Flags that matter:

g       global — find all matches
i       case-insensitive
m       multiline — ^ and $ match line boundaries
s       dotall — . also matches newlines
u       unicode — properly handle surrogate pairs

That's most regex you'll ever write. The rest of this post is when to use each one and when not to.

Character classes: prefer them over alternation

Beginners often write (a|b|c|d|e) when they mean [abcde]. The character-class version is:

Alternation is the right tool when alternatives are multi-character: (jpg|png|gif|webp) can't be expressed as a character class.

The greedy vs lazy trap

By default, quantifiers are greedy — they match as much as possible while still allowing the overall regex to succeed. This is wrong almost as often as it's right.

Classic example: extracting tags from a string.

const html = 'bold and italic';

// Greedy — wrong
html.match(/<(.+)>/);
// matches the WHOLE string from  to 

// Lazy — right
html.match(/<(.+?)>/);
// matches just 

The ? after a quantifier (+?, *?, {2,5}?) makes it lazy — match as little as possible. When matching content between delimiters, you almost always want lazy.

An alternative pattern that doesn't need laziness: explicitly exclude the closing delimiter from the character class.

// Match anything that's NOT >
html.match(/<([^>]+)>/);
// captures "b"

This is often faster than the lazy version, because the regex engine doesn't have to backtrack.

Word boundaries — useful and underused

\b matches a position (zero-width) between a word character and a non-word character. It's the difference between matching "cat" in "category" (you don't want to) and matching it in "the cat sat" (you do).

'category'.match(/cat/)        // matches — wrong
'category'.match(/\bcat\b/)    // null — right
'the cat sat'.match(/\bcat\b/) // matches — right

Use \b when matching whole words. Common pattern: /\b(?:TODO|FIXME|HACK)\b/g to find code comment markers.

What regex is genuinely bad at

HTML parsing

HTML is not a regular language. It nests, has implicit closing rules, supports entities, allows attributes in any order, has CDATA sections — and any production regex that tries to handle all of it becomes a CVE waiting to happen. Use a real HTML parser. In JavaScript: document.createElement('div').innerHTML = source; in Python: BeautifulSoup; in Go: golang.org/x/net/html.

If you're 100% sure your input is simple and won't contain malformed HTML or scripts, regex can work for one-off scripts. Never for production parsing.

Email validation

RFC 5322 email addresses can be extremely weird. "Alice Smith"@example.com is a valid address. So is "a\"b"@example.com. So is postmaster@[IPv6:2001:db8::1]. Writing a regex that matches all valid RFC 5322 addresses is possible but produces a 6,000-character monster.

For practical "is this probably an email" validation, use a simple regex that catches obvious typos:

/^[^\s@]+@[^\s@]+\.[^\s@]+$/

This rejects strings without @, with spaces, or without a dot in the domain. It accepts plenty of invalid-by-RFC addresses (and rejects some valid ones), but covers 99.9% of what users actually type. For real validation, send a verification email — that's the only check that actually proves the address works.

Balanced parentheses

You can't match arbitrary balanced parens with regex (in the regular-language sense). PCRE and some other engines have recursive features that simulate it, but the resulting expressions are unreadable and catastrophically slow on adversarial input. Use a tokenizer.

Catastrophic backtracking — the production-killer

Some regex patterns can take exponential time on certain inputs. Classic example:

// This pattern looks harmless...
const pattern = /^(a+)+$/;

// On 30 A's followed by a B, it takes seconds:
pattern.test('a'.repeat(30) + 'b');
// 100s of millions of backtracking attempts

The pattern (a+)+ creates ambiguity: any a can be part of the inner a+ or the outer +. For each character, there are multiple ways to partition the match, and the engine tries all of them when the match ultimately fails.

The general rule: nested quantifiers on the same character (or overlapping character classes) cause backtracking. Watch for (.+)+, (\w+)*, (\d+)+, etc. These are red flags.

Fix by:

  • Removing the unnecessary outer quantifier — a+ alone matches the same things as (a+)+
  • Using atomic groups (?>...) in engines that support them (Perl/Java/Ruby/.NET — NOT JavaScript)
  • Using possessive quantifiers ++, *+ (same engines)
  • Restructuring the pattern to remove ambiguity

This is the entire reason behind several real-world DDoS attacks (ReDoS). Cloudflare's 2019 global outage was caused by exactly this bug in a WAF rule.

Capture groups: when to use them, when to skip

Every time you write (), the regex engine captures the matched substring and stores it. This costs performance. If you don't need the captured value, use a non-capturing group (?:...):

// Both match the same things, but...
'abcabc'.match(/(abc)+/)     // captures "abc", group 1 = "abc"
'abcabc'.match(/(?:abc)+/)   // captures nothing, faster

Use (?:...) as your default for grouping; use (...) only when you specifically want to capture or backreference.

Backreferences

A backreference matches the same content a previous capture group matched. Inside a regex, use \1, \2, etc. In a replacement string, use $1, $2.

// Find repeated words ("the the", "to to")
text.match(/\b(\w+)\s+\1\b/g);

// Swap last name and first name in a CSV
'Smith, Alice'.replace(/(\w+), (\w+)/, '$2 $1');
// "Alice Smith"

Useful but slow — backreferences prevent many optimizations the engine could otherwise apply. Use sparingly on big inputs.

Named groups (modern engines)

JavaScript (ES2018+), Python, Ruby, .NET support named capture groups. Replace cryptic $1/$2 with meaningful names:

const date = '2026-05-20';
const m = date.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
m.groups.year   // "2026"
m.groups.month  // "05"
m.groups.day    // "20"

// In replacements:
date.replace(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
             '$<day>/$<month>/$<year>');
// "20/05/2026"

Use named groups in production code. The readability win is large and the performance cost is zero.

Lookahead and lookbehind: zero-width assertions

Lookahead/lookbehind match a position based on what's around it, without consuming characters. Four forms:

(?=foo)   positive lookahead — followed by foo
(?!foo)   negative lookahead — not followed by foo
(?<=foo)  positive lookbehind — preceded by foo
(?<!foo)  negative lookbehind — not preceded by foo

Practical uses:

// Numbers not preceded by a dollar sign
'$50 and 30 chickens'.match(/(?<!\$)\b\d+\b/g);
// ["30"]

// Letters followed by a digit (without including the digit)
'a1 b2 c'.match(/[a-z](?=\d)/g);
// ["a", "b"]

// Words not ending in 's' (case for plurals)
'cats dogs bird'.match(/\b\w+(?<!s)\b/g);
// ["bird"]

Caveat: lookbehind support varies by engine. JavaScript got it in ES2018; Safari supported it in 2020. Older environments may not support it. For maximum portability, use lookahead only.

Regex flavors: not all engines are equal

The features above work in most modern engines but with subtle differences. Major flavors:

  • POSIX BRE (basic regex, original grep). Requires escaping for +, ?, {. Doesn't support most modern features. Avoid unless forced.
  • POSIX ERE (extended regex, grep -E, sed -E). Standard +, ?, but no \d, \w, \b.
  • PCRE (Perl-Compatible Regex). The most feature-complete; supports atomic groups, recursion, possessive quantifiers. Used by PHP, nginx, many tools.
  • JavaScript (ECMAScript regex). Almost PCRE-equivalent for modern engines, but no atomic groups or possessive quantifiers (the things that prevent ReDoS).
  • Python re. Similar to JavaScript. New Python 3.11+ re has some additional features.
  • Go RE2. Different design — no backreferences, no lookarounds, but guaranteed linear time. Cannot have catastrophic backtracking. Powers Google Code Search.

If you're writing regex that takes user input, RE2-style engines are dramatically safer because they cannot ReDoS. Go uses RE2 natively; Ruby has RE2 bindings; many JavaScript runtimes are adding RE2 modes.

Testing patterns

Don't ship a regex you haven't tested. Use a regex tester (like ours) to verify:

  • Positive cases: inputs that should match, with the expected capture groups.
  • Negative cases: inputs that should NOT match.
  • Adversarial cases: inputs that could cause backtracking — long strings of a character that appears in nested quantifiers.
  • Edge cases: empty string, single character, the boundaries of your character ranges.

Production regex always wants a comment explaining what it matches. Six months later, you won't remember:

// Match ISO 8601 dates: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS[Z|+HH:MM]
const ISO_DATE = /^\d{4}-\d{2}-\d{2}(?:T\d{2}:\d{2}:\d{2}(?:Z|[+-]\d{2}:\d{2})?)?$/;

When regex isn't the right tool

If your regex is over 200 characters, or has more than ~10 quantifiers, or you're using recursion-emulating features, you've probably outgrown regex. Switch to:

  • A real parser for structured input (HTML, JSON, programming languages). Hand-written for simple cases; PEG / ANTLR / parser-combinator library for complex ones.
  • A purpose-built library for common formats (email validation libraries, URL parsers, semantic version parsers).
  • Multiple smaller regex + string operations. Sometimes splitting into pieces ("first find the lines that start with X, then for each, extract Y") is clearer than one huge pattern.

Regex is a fantastic tool for searching, simple validation, and tokenizing. It's a terrible tool for parsing or anything where structure matters. The difference: regex tells you "does this string look like X?", parsing tells you "what does this string mean?". When you need the second, use the right tool.

Found this useful? Share it with a developer who'd want to read it. Have a topic to suggest? Email hello@toolsy.website.

← More posts