Question

Task:
- given: a list of images filenames
- todo: create a new list with filenames not containing the word "thumb" - i.e. only target the non-thumbnail images (with PIL - Python Imaging Library).

I've tried r".*(?!thumb).*" but it failed.

I've found the solution (here on stackoverflow) to prepend a ^ to the regex and to put the .* into the negative lookahead: r"^(?!.*thumb).*" and this now works.

The thing is, I would like to understand why my first solution did not work but I don't. Since regexes are complicated enough, I would really like to understand them.

What I do understand is that the ^ tells the parser that the following condition is to match at the beginning of the string. But doesn't the .* in the (not working) first example also start at the beginning of the string? I thought it would start at the beginning of the string and search through as many characters as it can before reaching "thumb". If so it would return a non-match.

Could someone please explain why r".*(?!thumb).*" does not work but r"^(?!.*thumb).*" does?

Thanks!

Was it helpful?

Solution 2

(Darn, Jon beat me. Oh well, you can look at the examples anyway)

Like the other guys have said, regex is not the best tool for this job. If you are working with filepaths, take a look at os.path.

As for filtering files you don't want, you can do if 'thumb' not in filename: ... once you have dissected the path (where filename is a str).

And for posterity, here are my thoughts on those regex. r".*(?!thumb).*" does not work as because .* is greedy and the lookahead is given a very low priority. Take a look at this:

>>> re.search('(.*)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('/tmp/somewhere/thumb', '', '')
>>> re.search('(.*?)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', '', '/tmp/somewhere/thumb')
>>> re.search('(.*?)((?!thumb))(.*?)', '/tmp/somewhere/thumb').groups()
('', '', '')

The last one is quite strange...

The other regex (r"^(?!.*thumb).*") works because .* is inside the lookahead, so you don't have any issues with characters being stolen. You actually don't even need the ^, depending on if you are using re.match or re.search:

>>> re.search('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', 'humb')
>>> re.search('^((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> re.match('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'

OTHER TIPS

Could someone please explain why r".*(?!thumb).*" does not work but r"^(?!.*thumb).*" does?

The first will always match as the .* will consume all the string (so it can't be followed by anything for the negative lookahead to fail). The second is a bit convoluted and will match from the start of the line, the most amount of characters until it encounters 'thumb' and if that's present, then the entire match fails, as the line does begin with something followed by 'thumb'.

Number two is more easily written as:

  • 'thumb' not in string
  • not re.search('thumb', string) (instead of match)

Also as I mentioned in the comments, your question says:

filenames not containing the word "thumb"

So you may wish to consider whether or not thumbs up is supposed to be excluded or not.

Ignoring all the bits about regular expressions, your task seems relatively simple:

  • given: a list of images filenames
  • todo: create a new list with filenames not containing the word "thumb" - i.e. only target the non-thumbnail images (with PIL - Python Imaging Library).

Assuming you have a list of filenames that looks something like this:

filenames = [ 'file1.jpg', 'file1-thumb.jpg', 'file2.jpg', 'file2-thumb.jpg' ]

Then you can get a list of files not containing the word thumb like this:

not_thumb_filenames = [ filename for filename in filenames if not 'thumb' in filename ]

That's what we call a list comprehension, and is essentially shorthand for:

not_thumb_filenames = []
for filename in filenames:
  if not 'thumb' in filename:
    not_thumb_filenames.append(filename)

Regular expressions aren't really necessary for this simple task.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top