using Python to search for keywords in pdf [duplicate]

https://stackoverflow.com/questions/23583333

19-07-2023
|

Question

I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:

import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
    print "yes"

why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist. any one can help me with it ?

Solution

\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).

Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :

re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)

The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.

I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.

Demo:

>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
...     print 'Found'
... 
Found

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow