Вопрос

Trying to learn regular expressions.

After reading this section on http://regular-expressions.info about laziness, greediness, and negated character classes as an alternative to laziness, I tried to use it on my own, but I can't figure out why the following wouldn't work.

echo "hello world is this the way?" | grep -oE '\<w[^\>]+\>'

Expected output:

world
way

Actual output:

world is this the way

Do word boundary characters (\< \>) need special escaping inside character classes?

I'm just doing this on the cli (bash 4.2.45, osx mavericks) for testing purposes. Would that be a factor?

I know that \b is also a word boundary character, but if I use it so the regexp is like so: \bw[^\b]+\b, I get the same output, but it includes the question mark.

Thanks!

Update:

I'm looking for an answer that uses a negated character class, in order to avoid backtracking in the regex engine as explained here under An Alternative to Laziness. If it's not possible to use a negated character class, I'm looking for explanation as to why.

Это было полезно?

Решение 2

You can use this pattern:

\bw\w+\b

This will capture all the words starting with w and consisting of word characters.

When using negation, you would have to list all the characters you want to exclude - and I'm sure there's more than word boundary and question mark you want to exclude.

Другие советы

Since word boundaries are usually defined by white space why not use

\<w[^[:space:]]+\>

If you want to include a simple w you may also use

\<w[^[:space:]]*\>
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top