Applied string mangling
Regexes, parsing, tokenising etc
December 8, 2019 — July 5, 2021
A.k.a. Un-natural language processing.
1 Regexp
A.k.a. regexes. A.k.a. “regular expressions”, from a principled origin they presumably had in the theory of syntax. However, regexes as commonly encountered encode a particular way of specifying a language, rather than some arbitrary class of regular languages.
The default flavour of string matching, available in a variety of flavours, all equally boring.
Because these are so ubiquitous, useful, and boring, there are a million bikeshedded tools for interactive regex design.
- AutoRegex: Convert from English to RegEx with Natural Language Processing
- ihateregex visualises regex and designs them interactively.
- regexper visualizes regexes beautifully.
- extendaclass tests and visualises regexes in PHP, python and javascript flavours.
- Rubular is a Ruby-based regular expression editor.
- regex101 is similar
- regexr same
Comby is a parsing/search replace thing designed for code.
1.1 Handy regexes
2 Parsers
The ad hoc world of regexes not cutting it? Why not generate a parser? Since every computer language out there does this, there are a lot of options. Since regexes can already parse regular languages you are probably looking for deterministic context free language parsers. I do not have much to say, except maybe check the wikipedia list?. Why not use David Beazley’s SLY? That looks like a nice parser.