Question

I know I'm doing something boneheaded here, but a regex cheat sheet given to us in my Unix/Linux class implied (by my reading) that this should find lines in a text file that contain years (numbers from 0000 to 9999):

grep \d{4} file.txt

Why does it find nothing instead? This is using bash, with the same result over PuTTY on Windows and in Terminal on Mac. I've tried variations with parentheses and quotation marks with no effect. Searching [0-9][0-9][0-9][0-9] works, but nothing with \d or {4} seems to function as I expect.

On a related note, why does .\+ have the effect I would expect .+ to have, while .+ just plain "doesn't work"? (I know that it tell grep to look for something but I don't know what.) Namely, it seems to be the way to ask for "one or more characters here" and not "one character followed by the plus sign". (It was the correct way to do an assignment, and the teacher could explain to my why it was that way) And how would one search for "one character followed by the plus sign"?

Was it helpful?

Solution

Your cheat-sheet may state that \d{4} is a valid regex meaning "four digits"; and it may state that grep searches a file for a regex. Taken separately, both of these statements are true. But taken together, they're highly misleading, since grep PATTERN FILE expects one kind of regular expression (POSIX "Basic Regular Expressions", BREs), whereas \d and {4} are notations from a different kind of regular expression (sometimes called "Perl Compatible Regular Expressions", PCREs, after the Perl programming language).

Many versions of grep support a -P flag to indicate that the pattern is a PCRE rather than a BRE; you can try:

grep -P '\d{4}' file.txt

(Note the single-quotes around \d{4}. These are necessary, because otherwise Bash will take \d as a sort of shorthand for 'd', so the actual pattern passed to grep will be d{4}, meaning "four d's" instead of "four digits". Alternatively, you could write grep -P \\d{4} file.txt, which solves the same problem in a different way.)


Edited to add: Sorry, I failed to cover the second part of your question, about +. So, according to the relevant specs,1 this:

grep .+ file.txt

uses . to mean "any character besides NUL" and + to mean "an actual plus-sign". So it really should print the lines of file.txt that contain a non-initial plus-sign; if you're seeing different behavior, then your shell and/or grep must be nonconformant.

Furthermore, this:

grep .\+ file.txt

is the same as the above, because a conforming POSIX shell (such as Bash) will treat \+ as a fancy way of writing +, so grep will see the same arguments as before. (grep will have no way of knowing that you typed .\+ rather than .+.)

Lastly, this:

grep '.\+' file.txt

(where the \ is actually passed through to grep) has undefined behavior: a given grep implementation can take it to mean the same thing as .+, or it can take \+ to be a special notation meaning "one or more" (or something else), or it can give an error message. The GNU implementation, as it happens, takes the "one or more" interpretation, but others may differ.

Footnotes:

  1. Namely the grep spec, and the BRE and ERE spec (which the grep spec links and refers to). Also relevant is the shell spec, since it's the shell decides the actual arguments that get passed to grep.

OTHER TIPS

By default, grep uses the POSIX regex flavor, which doesn't include \d. To use your expression, you need to switch to PCRE (arg -P)

grep -P \\d{4} file.txt 

This will return every instance of 4-digit strings in file.txt.

If your version of grep happens to not support -P, the following will work:

grep "\d\{4\}" file.txt

As to your other questions, using the same flavor of regular expressions, .+ will match any character followed by a + sign. .\+ will match one or more of any character.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top