Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --JamieZawinski
- Also known as regex or regexp (RegExp)
are a fundamental building block of ComputerScience
. In their basic form, the language of RegularExpressions
is formally equivalent to FiniteAutomata
s), both of the deterministic and nondeterministic flavors. Regexes are also extremely commonly used in Unix, and are basically the heart of the Perl programming language.
They are also a wonderful way to write programs that are WrongButGoodEnough?
are most often incorrect - to do a correct search, you usually need to consider the structure of the file. -- LexSpoon
Note that the programming language feature
known as regular expressions, and the thing from computation theory (the set of grammars which can be recognized by a DFA) are two different things. RegularExpressionsArent
I find RegularExpressions
a wonderful way of coming up with programs that are Right, let alone GoodEnough
. In practice, RegularExpressions
are usually capable of handling the structure of the file (in other words, for simple enough structures). Granted, there is often the temptation to use overly simplified regex - but this way lies madness. If it is a relatively sane parsing problem, RegularExpressions
are just the ticket. -- AnonymousDonor
are surprisingly good at analysing imprecise, free-form or natural-language material. They really suck for structured-text analysis, which I think is what LexSpoon
is talking about.
I guess a good exercise is to try and make a RegularExpression
that matches all SGML/XML tags within text. You have to make sure they are not within a quoted environment, and in searching the end of the tag, you have to take into account comments, quotes and quoted quotes. And that doesn't yet even check the tag is correctly written. The complete RegularExpression
might get many lines long. -- PanuKalliokoski
Actually, I believe complete XML parsing with RegularExpressions
isn't possible, due to limitations inherent to RegularExpressions
. It's been a couple of years since I read a book about 'compiler building' (the new DragonBook?
?, I believe it was), and I'd have to crack it open again to fill in the details. (And I believe there is some information on wiki about the three levels of grammars, but I can't find it. Anyhow, only the most restricted of those levels can be parsed with only regexps. And XML isn't in that level).
Matching all tags
is not the same as parsing
. XML parsing is indeed impossible with Regular Expressions, as is anything that allows arbitrarily deep nesting (the touchstone of a non-regular language), but that doesn't imply that it is impossible to at least pluck out all tags with a nice, long regular expression. I don't think it's impossible but a standard Finite Automata to do so would be as ugly as sin, because standard Finite Automata do not fare well with rules like "Anything but an A followed by a B"... nest two or three of those within each other and it passes the human ability to read or write with the multiplicity of states, but it is still, technically, an FA. A regular expression can be written because while a correct one must worry about CDATA and commenting, there are still a finite number of permutations of "modes" you can be in, because you can't be in a "CDATA nested in a comment nested in CDATA nested in CDATA".
A few pages on this wiki:
, move to bottom?]
for the regular expressions this Wiki uses.
Other sources of information on RegularExpressions:
Jeffrey E.F. Friedl's book MasteringRegularExpressions
has more information than you probably want to know, including specifics of POSIX RegularExpressions
For an online tutorial see: http://www.regular-expressions.info/about.html
Regular expressions are a broken form of PrologLanguage
. Without cuts!
You can't write a serious program with regular expressions. Even toy programs end up having to reimplement cuts outside of regexp. That's because each additional subexpression explodes the space which has to be backtracked, and the run time ends up being very polynomial.
Concrete example: I have a regular expression made up of 27 distinct subexpressions, and I want to match the whole sucker to a string that contains a single
matching entry. Matching 17 of those 27 subexpressions takes 0.8 seconds. 19 takes 8 seconds. I ran out of patience so I don't even know how long it takes to match 21 out of 27.
Oh, and here's the bad thing. Regular expressions are also
polynomial in the number of matching entries
they're working over. But if you limit it to 3 subexpressions, it's linear. Go figure.
The problem is Perl (or some other PCRE language). Use awk (lol) and/or see http://swtch.com/~rsc/regexp/regexp1.html
Restructuring my program so it CAN make use of cuts, which of course I'll have to implement, will be a pain.
s were introduced by S.C. Kleene to describe the McCulloch?
and Pitts 1943 finite automata model of neurons. ("Representation of Events in Nerve Nets", p3-40 in ClaudeShannon
"Automata Studies", 1956)
The first application of RegularExpression
s to editor search/replace (in the QED editor) was by KenThompson
, who published a RegularExpression
-to-NFA algorithm in 1968, "Regular Expression Search Algorithm", CACM 11:6, 419-422
went on to reimplement this in the Unix ed
editor, which BillJoy
turned into the vi
adapted the ed
code for grep
. (Some years after its creation, Emacs eventually borrowed the idea of RegularExpression
s, but not the code, directly from these Unix editors -- RMS, private communication)
(prior to, and building towards, his Unix yacc
tool) and MikeLesk?
(in the Unix lex
) did some of the earliest applications of RegularExpression
s to compiler lexical analyzers via automated DFA-building tools.
Awk is a scripting language/command line tool derived directly from this Unix Culture of RegularExpression
s; it is no coincidence that the language most famous for RegularExpression
s today, perl
, was developed in a Unix environment, inspired by awk
and other Unix RegularExpression
s were thus widespread in Unix tools of all sorts from the beginning, years to decades before this technology was widespread elsewhere (although obviously there were exceptions), and RegularExpression
s have always been an extremely important (albeit under-acknowledged) part of UnixCulture
, contributing to the historical attitudes of SmugUnixWeenies?
that other systems JustDontGetIt
Pipes have historically been considered to have been the KillerApp
of Unix, but there's a strong argument that it was actually RegularExpression
s. Never heard that, but if its a "strong" argument, I'm convinced
RegExp Coach http://farm.tucows.com/blog/_archives/2004/10/27/167823.html
RegexCoach (same as above?) http://www.weitz.de/regex-coach/
- Very good! And in Lisp, for all the SmugLispWeenies out there.
Steve Ramsay's Guide to Regular Expressions http://etext.lib.virginia.edu/helpsheets/regex.html
- a regular expression development and execution tool, supporting more than 40 search programs and programming languages.
regexp visualizer - can highlught what each part of the regexp matches - very useful when learning regexps
I've been known to refer to RegularExpressions
as "a precise notation for fuzzy matches
" or "a precise notation for expressing fuzzy search logic
" when explaining them to people familiar only with "string matching" as their search method.
The concept of inclusion in/exclusion from sets, and a grammar for saying things like "any digit at the beginning of a line, followed by at least 1 but not more than 12 of any non-punctuation character, followed by any combination of alpha-numeric characters joined to a set of non-empty parentheses, said line ending in a semicolon, possibly followed by white space" in a compact and exact notation is not necessarily easily taught, but the look of revelation when one "gets it" is priceless.
As a pedagogical tool for helping people to "get it", that's interesting. I suppose it doesn't matter that it's technically inaccurate to call it fuzzy matching, since it depends on the pattern whether or not one specifies a literally fuzzy match in the technical ZadehLotfi FuzzyLogic sense, since these people won't be familiar with FuzzyLogic now nor in the future?
Just so. I originally hesitated before using "fuzzy" to describe it, but you're right, FuzzyLogic
is well beyond the scope of where they'll go (though I usually throw in a disclaimer that I'm playing fast and loose with the term). The main hurdle for them is usually the use of "any" and "not" in conjunction with "at least" and so on. It's getting them to the point where they realize that they're not looking for precise spellings and not confined by having to examine every occurrence of some word just because they can't remember exactly
what words are around it. It's a little liberty I take -- somewhat in the name of WittgensteinsLadder
-- to get over the concept hump.
I think I just broke a refactoring record. I was rewriting code from a BadProgrammer
and I chopped 127 lines into 7 using a RegularExpression
:) The Regex parsed an XML file stream and split out 2 elements by name, with values optionally in quotes. The other programmer manually parsed the file, split it into arrays, split those into key/value pairs, foreach-ed through those till he found the ones he wanted, stored them into separates strings, then later recombined them into the final XML. OMG what a nightmare it was, but I will never go back to a non-RegularExpression
language! (I'm a former VB6 junkie)
- RegularExpression was not supported well in MS environments (buggy and slow) before. Why did you not use the Xml parsers to do what you need to do?
18.104.22.168 -- allows Find in this page, but without RegularExpression
s -- Maybe the OpenSource
contributors can add that.
See also: AlternativesToRegularExpressions