Question

Can regexp pattern matching be used to check for palindromes within a given text in Java?

My intuition is that in order to check for palindromes, we may need to remember the character that was parsed, which cannot be done using regexp. Is that correct?

Was it helpful?

Solution

First off, when you start poking around with questions like this, you will mix the aspects of Computer Science and what is actually out there. Computer Science considers a regular expression different than what perl m// or java.util.regex can do.

This is the Apocalypse on Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

Larry wall from Apocalypse 5, June 4th, 2002

What exists in regex as we programmers know them is a significant superset of what traditional finite automata can do (that's a fun paper to read if you like the science part of CS).

The thing that this question boils down to is the pumping lemma on regular languages. This can be said that if you have a regular language, you can put things into the middle without changing if the language is accepted.

Consider a language of binary palindromes. It matches 101. It also matches 1001 and 10001 and so forth. You can put as many things in the middle there without changing if it should be accepted. If its regular, it will handle 10101010101010101 and so on.

However, a regular expression is just short hand for writing a finite automata. The word finite is key. It has a finite number of states. This means that if I put enough things in that word to be tested, it will repeat the same state twice if there are more elements to it than there are states (this is the pigeonhole principle). Once that happens, the state of the automata will get confused and either generate a false positive or a false negative... and thus, using the pumping lemma, it can be shown that a language containing palindromes of arbitrary length is not regular.


Lets look more into the pumping lemma for regular languages. The informal wording from wikipedia is:

Informally, it says that all sufficiently long words in a regular language may be pumped — that is, have a middle section of the word repeated an arbitrary number of times — to produce a new word that also lies within the same language.

So, lets take the language "a followed by some number of b followed by an a" which can be written as ab*a. Within this regular language you have aa and aba and abba and abbba. Applying the pumping lemma to this I can take the bbb part and stick in it in there again and if it still matches, it is a regular language. So, putting bbb in the middle of abba I get abbbbba which is still part of the language (refiddle). You can do this for any thing that is a regular language.

On the other hand, if you can put this in there and have something that isn't part of the language be matched, or should be part of the language but isn't matched you have a proof by contradiction that the original language isn't regular.

So, lets say "I have a language of binary palindromes of arbitrary length" The regular expression to match it is ([01]?)([01]?)([01]?)([01]?)([01]?)[01]?\5\4\3\2\1 (note, I'm being a bit free with the language here being an extended one with back reference shorthand, but noting that would take a bit more text to properly represent it otherwise).

The above regular expression will match 010 and reject 011. It will match 10101 but not 10111. Great. But the pumping lemma says that I should be able to put an arbitrarily large amount of the middle into the middle. Now, the reader should go back and look at that regular expression... it has 5 back references. It can't "store" more than 5 in there. Thus, if I was to put

12345.....54321
101001111100101

in there, it wouldn't match this string (refiddle), but that is a palindrome. You can stick any palindrome of 12 or more characters in there and it will not match.

Thus, using the pumping lemma, I can say "palindromes of arbitrary length is not regular". If it was, I could stick a thousand or million 1s in the middle there and pump it up... but I can't, so it isn't.

(Those who have dealt with regular languages before will likely recognize the palindrome of [01] is the parentheses matching language with a slight twist... which also isn't regular - see also Can regular expressions be used to match nested patterns? or To make sure: Pumping lemma for infinite regular languages only?)

A key part of the regular language is that it represented by a finite number of states. What the contradiction with 101001111100101 did was to try to make it longer than the longest possible word that the language could recognize - which broke it.

Why not add more states? Well, you can only put a finite number of states in there. ([01]?)...([01]?)[01]?\100000...\1 ... and I'll make a palindrome that is 200,002 characters long, and it won't recognize it. If you need an infinite number of states to present the language, it isn't a regular language (it might be context free (and palindromes are), but that's something else).

OTHER TIPS

No.

As others have mentioned, regular expressions cannot match palindromes of arbitrary length.

However, many programming languages have extended regular expressions in a way that puts them theoretically between regular and context-free languages, allowing them to match certain patterns that a strict regular expression by the textbook would not be able to match. Perl is one of these languages.

Java accepts several Perl extensions, but not the recursion tag required to implement this. What Perl can do is see that the first and last characters of the string match, then recursively use the same regex to attempt to match whatever is in the middle. Java does not support this nonstandard regex extension.

Here is another question that affirms it is not possible, and one more that offers a mathy CS proof of why it is not possible.

An answer from SO.

https://stackoverflow.com/questions/233243/how-to-check-that-a-string-is-a-palindrome-using-regular-expressions

I played around with these, and I had some success.

(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1

\W(.?)(.?)(.?)(.?)(.?).?\5\4\3\2\1\W

I think the real answer is that RE is a poor method for finding palindromes.

Licensed under: CC-BY-SA with attribution
scroll top