Skip to content

Home

regexr

Regular expressions for humans

Instead of writing a regular expression to match an URL:

# need to be compiled with re.X
regex = r'''
    ^(?P<protocol>http|https|ftp|mailto|file|data|irc)://
    (?P<domain>[A-Za-z0-9-]{0,63}(?:\.[A-Za-z0-9-]{0,63})+)
    (?::(?P<port>\d{1,4}))?
    (?P<path>/*(?:/*[A-Za-z0-9\-._]+/*)*)
    (?:\?(?P<query>.*?))?
    (?:\#(?P<fragment>.*))?$
'''

You can write:

regexr = Regexr(
    START,
    ## match the protocol
    Or('http', 'https', 'ftp', 'mailto', 'file', 'data', 'irc', capture="protocol"),
    '://',
    ## match the domain
    Capture(
        Repeat(OneOfChars('A-Z', 'a-z', '0-9', '-'), m=0, n=63),
        OneOrMore(DOT, Repeat(OneOfChars('A-Z', 'a-z', '0-9', '-'), m=0, n=63)),
        name="domain",
    ),
    ## match the port
    Maybe(':', Capture(Repeat(DIGIT, m=1, n=4), name="port")),
    ## match the path
    Capture(
        ZeroOrMore('/'),
        ZeroOrMore(
            ZeroOrMore('/'),
            OneOrMore(OneOfChars('A-Z', 'a-z', '0-9', r'\-._')),
            ZeroOrMore('/'),
        ),
        name="path",
    ),
    ## match the query
    Maybe("?", Capture(Lazy(MAYBE_ANYCHARS), name="query")),
    ## and finally the fragment
    Maybe("#", Capture(MAYBE_ANYCHARS, name="fragment")),
    END,
)

Inspired by rex for R and Regularity for Ruby.

Why?

We have re.X to compile a regular expression in verbose mode, but sometimes it is still difficult to read/write and error-prone.

  • Easy to read/write regular expressions

  • For example, []] might need a second to understand it. But we can write it as OneOfChars("]") and it will be easier to read.

  • Easy to write regular expressions with autocompletions from IDEs

  • When we write raw regex, we can't get any hints from IDEs

  • Non-capturing for groups whether possible

  • For example, with Maybe(Maybe("a", "b)) we get (?:(?:ab)?)?

  • Easy to avoid unintentional errors

  • For example, sometimes it's difficult to debug with r"(?P<a>>\d+)\D+\a when we accidentally put one more > after the capturing name.

  • Easy to avoid ambiguity

  • For example, ? could be a quantifier meaning 0 or 1 match. It could also be a non-greedy (lazy) modifier for quantifiers. It's easy to be distinguished by Maybe(...) and Lazy(...) (or quantifiers with lazy=True).

  • Easily avoid unbalanced parentheses/brackets/braces

  • Especially when we want to match them. For example, Capture("(") instead of (\().

Usage

More examples

  • Matching a phone number like XXX-XXX-XXXX or (XXX) XXX XXXX

    Regexr(
        START,
        # match the first part
        Maybe(Capture('(', name="open_paren")),
        RepeatExact(DIGIT, m=3),
        Conditional("open_paren", yes=")"),
    
        Maybe(OneOfChars('- ')),
    
        # match the second part
        RepeatExact(DIGIT, m=3),
    
        Maybe(OneOfChars('- ')),
    
        # match the third part
        RepeatExact(DIGIT, m=4),
        END,
    )
    
    # compiles to
    # ^(?P<open_paren>\()?\d{3}(?(open_paren)\))[- ]?\d{3}[- ]?\d{4}$
    
  • Matching an IP address

    # Define the pattern for one part of xxx.xxx.xxx.xxx
    ip_part = Or(
        # Use Concat instead of NonCapture to avoid brackets
        # 250-255
        Concat("25", OneOfChars('0-5')),
        # 200-249
        Concat("2", OneOfChars('0-4'), DIGIT),
        # 000-199
        Concat(Or("0", "1"), RepeatExact(DIGIT, m=2)),
        # 00-99
        Repeat(DIGIT, m=1, n=2),
    )
    
    Regexr(
        START,
        ip_part,
        RepeatExact(DOT, ip_part, m=3),
        END,
    )
    # compiles to
    # ^(?:25[0-5]|2[0-4]\d|(?:0|1)\d{2}|\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|(?:0|1)\d{2}|\d{1,2})){3}$
    
  • Matching an HTML tag roughly (without attributes)

    Regexr(
        START,
        "<", Capture(WORDS, name="tag"), ">",
        Lazy(ANYCHARS),
        "</", Captured("tag"), ">",
        END,
    )
    # compiles to
    # ^<(?P<tag>\w+)>.+?</(?P=tag)>$
    

Pretty print a Regexr object

With the example at the very beginning (matching an URL), we can pretty print it:

# print(regexr.pretty())
# prints:

^
(?P<protocol>http|https|ftp|mailto|file|data|irc)
://
(?P<domain>
  [A-Za-z0-9-]{0,63}
  (?:\.[A-Za-z0-9-]{0,63})+
)
(?::(?P<port>\d{1,4}))?
(?P<path>
  /*
  (?:/*[A-Za-z0-9\-._]+/*)*
)
(?:\?(?P<query>.*?))?
(?:\#(?P<fragment>.*))?
$

Compile a Regexr directly

Regexr("a").compile(re.I).match("A")
# <re.Match object; span=(0, 1), match='A'>

API documentation

https://pwwang.github.io/regexr/

TODO

  • Support bytes