Home
regexr
Regular expressions for humans
Instead of writing a regular expression to match an URL:
# need to be compiled with re.X
regex = r'''
^(?P<protocol>http|https|ftp|mailto|file|data|irc)://
(?P<domain>[A-Za-z0-9-]{0,63}(?:\.[A-Za-z0-9-]{0,63})+)
(?::(?P<port>\d{1,4}))?
(?P<path>/*(?:/*[A-Za-z0-9\-._]+/*)*)
(?:\?(?P<query>.*?))?
(?:\#(?P<fragment>.*))?$
'''
You can write:
regexr = Regexr(
START,
## match the protocol
Or('http', 'https', 'ftp', 'mailto', 'file', 'data', 'irc', capture="protocol"),
'://',
## match the domain
Capture(
Repeat(OneOfChars('A-Z', 'a-z', '0-9', '-'), m=0, n=63),
OneOrMore(DOT, Repeat(OneOfChars('A-Z', 'a-z', '0-9', '-'), m=0, n=63)),
name="domain",
),
## match the port
Maybe(':', Capture(Repeat(DIGIT, m=1, n=4), name="port")),
## match the path
Capture(
ZeroOrMore('/'),
ZeroOrMore(
ZeroOrMore('/'),
OneOrMore(OneOfChars('A-Z', 'a-z', '0-9', r'\-._')),
ZeroOrMore('/'),
),
name="path",
),
## match the query
Maybe("?", Capture(Lazy(MAYBE_ANYCHARS), name="query")),
## and finally the fragment
Maybe("#", Capture(MAYBE_ANYCHARS, name="fragment")),
END,
)
Inspired by rex for R and Regularity for Ruby.
Why?
We have re.X
to compile a regular expression in verbose mode, but sometimes it is still difficult to read/write and error-prone.
-
Easy to read/write regular expressions
-
For example,
[]]
might need a second to understand it. But we can write it asOneOfChars("]")
and it will be easier to read. -
Easy to write regular expressions with autocompletions from IDEs
-
When we write raw regex, we can't get any hints from IDEs
-
Non-capturing for groups whether possible
-
For example, with
Maybe(Maybe("a", "b))
we get(?:(?:ab)?)?
-
Easy to avoid unintentional errors
-
For example, sometimes it's difficult to debug with
r"(?P<a>>\d+)\D+\a
when we accidentally put one more>
after the capturing name. -
Easy to avoid ambiguity
-
For example,
?
could be a quantifier meaning0
or1
match. It could also be a non-greedy (lazy) modifier for quantifiers. It's easy to be distinguished byMaybe(...)
andLazy(...)
(or quantifiers withlazy=True
). -
Easily avoid unbalanced parentheses/brackets/braces
-
Especially when we want to match them. For example,
Capture("(")
instead of(\()
.
Usage
More examples
-
Matching a phone number like
XXX-XXX-XXXX
or(XXX) XXX XXXX
Regexr( START, # match the first part Maybe(Capture('(', name="open_paren")), RepeatExact(DIGIT, m=3), Conditional("open_paren", yes=")"), Maybe(OneOfChars('- ')), # match the second part RepeatExact(DIGIT, m=3), Maybe(OneOfChars('- ')), # match the third part RepeatExact(DIGIT, m=4), END, ) # compiles to # ^(?P<open_paren>\()?\d{3}(?(open_paren)\))[- ]?\d{3}[- ]?\d{4}$
-
Matching an IP address
# Define the pattern for one part of xxx.xxx.xxx.xxx ip_part = Or( # Use Concat instead of NonCapture to avoid brackets # 250-255 Concat("25", OneOfChars('0-5')), # 200-249 Concat("2", OneOfChars('0-4'), DIGIT), # 000-199 Concat(Or("0", "1"), RepeatExact(DIGIT, m=2)), # 00-99 Repeat(DIGIT, m=1, n=2), ) Regexr( START, ip_part, RepeatExact(DOT, ip_part, m=3), END, ) # compiles to # ^(?:25[0-5]|2[0-4]\d|(?:0|1)\d{2}|\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|(?:0|1)\d{2}|\d{1,2})){3}$
-
Matching an HTML tag roughly (without attributes)
Regexr( START, "<", Capture(WORDS, name="tag"), ">", Lazy(ANYCHARS), "</", Captured("tag"), ">", END, ) # compiles to # ^<(?P<tag>\w+)>.+?</(?P=tag)>$
Pretty print a Regexr
object
With the example at the very beginning (matching an URL), we can pretty print it:
# print(regexr.pretty())
# prints:
^
(?P<protocol>http|https|ftp|mailto|file|data|irc)
://
(?P<domain>
[A-Za-z0-9-]{0,63}
(?:\.[A-Za-z0-9-]{0,63})+
)
(?::(?P<port>\d{1,4}))?
(?P<path>
/*
(?:/*[A-Za-z0-9\-._]+/*)*
)
(?:\?(?P<query>.*?))?
(?:\#(?P<fragment>.*))?
$
Compile a Regexr
directly
Regexr("a").compile(re.I).match("A")
# <re.Match object; span=(0, 1), match='A'>
API documentation
https://pwwang.github.io/regexr/
TODO
- Support bytes