Invariably, people on Stack Overflow ask questions about how to parse XML, HTML and their ugly daughter XHTML. Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems, to quote the famous anti-regex saying.

Now, don't get me wrong. I like regular expressions. Parsing, formal languages, finite automata; these are among my favourite things. Regular expressions course through my veins. That's why it hurts me so when people try to misapply them and then emit quotes like the one in the previous paragraph.

Regular expressions were not designed to apply in every situation. They are not even remotely close to a universal parser (that would be a Turing machine). For some reason, inexperienced programmers equate the concept of parsing things with using regular expressions. Not only is that just plain misinformed, it's often more work to bend an algorithm to use regular expressions in the wrong situation than it is to do something more directly.

Determining if regex is right for your situation is simple. Figure out the structure of things you want to match. The list of all words you want to match is called a language. When you define a regular expression, you're defining a language. Since the language is defined by a regular expression, the language is called "regular". Regular languages are rather limited. They can't guarantee things like reversal (in other words, no palindromes) or simple counting (no strings with the same number of 'A's as 'B's). With enough experience, you can easily get a feel for what languages can and cannot be expressed using regular expressions.

Until that level of experience is attained, there are formal ways of proving languages are regular. The easiest way to prove a language is regular is by coming up with a regular expression that defines it perfectly. Often, easiest way to prove a language is not regular is using something called the pumping lemma, which is sort of a strategy for picking words that are unlikely to be regular.

Now, if you'll humour me, I'm going to prove that XML is nonregular:

XML is very well defined, and I won't go into its (very tedious) detail here, but suffice it to say that a simple XML document is of the form <word>donuts</word>, as long as both "word" parts are the same.

Informally, the pumping lemma tells us we need to come up with a valid XML document with certain specific properties. In this case, I'll be using the document <an>donuts</an>. The n in this example is the one required by the pumping lemma. This is a valid XML document for any value of n greater than or equal to 1.

Now, the next step is to note that the first n characters in the XML document contain at least one character a whenever n is two or greater.

So, it's easy to see that if you were to take any subset of the first n characters from the document and repeat them, you have a number of a characters other than n. But this repetition did not effect the </an> part of the document. So the document is no longer valid XML.

If XML were a regular language, the document would still be valid. Therefore, XML must not be a regular language.

Now, I recognize this proof might be over the heads of some people and obviously under the heads of those who are as familiar with the pumping lemma as I am (I leave filling in the edge cases of the proof as an exercise to the reader), but the bottom line is XML is not regular. Therefore, regular expressions are unsuited to parsing XML, since they cannot clearly be used to define an XML document. That's why, whenever anyone tries to parse XML with regular expressions, the expressions become unwieldy and complicated the more they are refined. Misapplying regular expressions is what causes many to believe that regular expressions themselves are useless or overcomplicated.

So, I ask you, please stop trying to parse XML (or any other non-regular language) with regular expressions. It's a bad idea, and I can prove it formally.

I recognize that XML stills needs to be parsed! Luckily most languages have built-in libraries that parse XML for you without you having to worry about the details: .NET has System.Xml, Java has javax.xml, etc. There are even languages specifically designed to parse XML quickly and reliably, like XPath, XSL and XQuery. These tools are all very easy to use and let you avoid writing overcomplicated regular expressions that won't even work in every situation anyway.

Use regular expressions only on regular languages. When in doubt, if you're finding your regex is getting convoluted, chances are you should be using something else.