Substring search for multiword strings - Python

https://stackoverflow.com/questions/15106584

15-03-2022
|

Question

I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line because that would have say that a seed word ring would have appeared in a doc with the word bring.

I also want to check whether multiword expressions (MWE) like word with spaces appears in the document.

I've tried this but this is uber slow, is there a faster way of doing this?

seed = ['words with spaces', 'words', 'foo', 'bar', 
        'bar bar', 'foo foo foo bar', 'ring']

 docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list',
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']

docs_seed = []
for d in docs:
  toAdd = False
  for s in seeds:
    if " " in s:
      if s in d:
        toAdd = True
    if s in d.split(" "):
      toAdd = True
    if toAdd == True:
      docs_seed.append((s,d))
      break
print docs_seed

The desired output should be this:

[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')]

Solution

Consider using a regular expression:

import re

pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)

\b matches the start or end of a "word" (sequence of word characters).

Example:

>>> for line in docs:
...     print pattern.findall(line)
... 
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]

OTHER TIPS

This should work and be somewhat faster than your current approach:

docs_seed = []
for d in docs:
    for s in seed:
        pos = d.find(s)
        if not pos == -1 and (d[pos - 1] == " " 
               and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
            docs_seed.append((s, d))
            break

find gives us the position of the seed value in the doc (or -1 if it is not found), we then check that the characters before and after the value are spaces (or the string ends after the substring). This also fixes the bug in your original code that multiword expressions don't need to start or end on a word boundary - your original code would match "words with spaces" for an input like "swords with spaces".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow