Behavior of 'grep -w -f" when there are multiple/incomplete matches

https://stackoverflow.com/questions/23505891

16-07-2023
|

Вопрос

I am using grep -w -f to extract lines from a file that match a pattern. However, if there is an incomplete match with a pattern in the input pattern file, it appears to be masking the complete match that appears later in the input pattern file. Is there another grep option that I am missing? For example:

$ head list
tt140
tt1351
tt1354
tt998
tt1122

$ head match1
tt135
tt1122
tt1351

$ grep -w -f match1 list
tt1122

It appears that the first pattern tt135 in match1 interferes with the later tt1351. If the first line is removed, the tt1351 match is reported.

$ head match2
tt1122
tt1351

$ grep -w -f match2 list
tt1351
tt1122

Is this the expected behavior? Is there another option to pass to grep to avoid this?

Решение

Indeed, as @japyal states, there appears to be bug in the BSD version of grep (which also affects OSX).

Workaround:

 grep -f <(sed 's/.*/\\<&\\>/' match1) list

This dynamically encloses the strings in match1 in explicit word-boundary regex assertions, as if match1 had been defined as:

\<tt135\>
\<tt1122\>
\<tt1351\>

The net effect is the same as if -w had been specified.

Другие советы

If you can't modify match1 as suggested by mklement0 , you could create the -w -f functionality using shell cmds to construct an equivalent grep cmd :

> egrep `cat match1 | xargs -d '\n' | sed 's/^/(\\</; s/$/\\>)/; s/ /\\>|\\</g;'` list
tt1351
tt1122

I don't have Mac or BSD to verify, but this works on linux for me.

Explanation : The part in backticks is constructing the desired regexp, which is then used in a vanilla egrep cmd.

> cat match1 | xargs -d '\n' | sed 's/^/(\\</; s/$/\\>)/; s/ /\\>|\\</g;'
(\<tt135\>|\<tt1122\>|\<tt1351\>)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow