Compound word Pattern detection using pandas for large datasets

https://stackoverflow.com/questions/22398266

14-06-2023
|

Question

Lets say I have two list of words one that follows the other. They are connected by a space or dash. To make it simple they will be the same words:

First=['Derp','Foo','Bar','Python','Monte','Snake']
Second=['Derp','Foo','Bar','Python','Monte','Snake']

So the following combinations of the following words exist(indicated by yes):

            Derp    Foo  Bar    Python  Monte   Snake
Derp        No      No   Yes    Yes     Yes     Yes
Foo         Yes     No   No     Yes     Yes     Yes
Bar         Yes     Yes  No     Yes     Yes     Yes
Python      No      Yes  Yes    No      Yes     Yes
Monte       No      Yes  Yes    No      No      No
Snake       Yes     No   Yes    Yes     Yes     No

I have a data set like this which I am detecting particular words:

df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})

If I use Regex and mark all the data that is from the pattern it would look something like this:

import pandas as pd


df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})
df['status']=''

patterns=['^Derp(-|\s)(Foo|Bar|Snake)$', '^Foo(-|\s)(Bar|Python|Monte)$', '^Python(-|\s)(Derp|Foo|Bar|Snake)', '^Monte(-|\s)(Derp|Foo|Bar|Python|Snake)$']


for i in range(len(patterns)):
    df.loc[df.Name.str.contains(patterns[i]),'status'] = 'Found'

print (df)

Here is the print:

>>> 

        Class             Name status
0  Politician          Al Gore       
1           L          Foo-Bar  Found
2           H     Monte-Python  Found
3           L     Python Snake  Found
4           L  Python Anaconda       
5           H    Python-Pandas       
6           H         Derp Bar  Found
7           L      Derp Python       
8           L       JavaScript       
9      Circus     Python Monte       

[10 rows x 3 columns]

For larger datasets it does not seem very feasible to write out all the Regex patterns. So is there a way to make a loop or something to go through patterns from a matrix of combinations to retrieve patterns that exist (indicated as yes in table above) and skip the ones that do not (indicated as no in table above)? I know that in the itertools library there is a function called combinations that can go through and generate all the possible patterns via looping.

Solution

I don't think it's too hard to generate those regexes from the combination matrix you've got:

# Reading in your combination matrix:
pattern_mat = pd.read_clipboard()
# Map from first words to following words:
w2_dict = {}
for w1, row in pattern_mat.iterrows():
    w2_dict[w1] = list(row.loc[row == 'Yes'].index)
# Print all the resulting regexes:
# (not sure if the backspace needs to be escaped?)
for w1, w2_list in w2_dict.items():
    pattern = "^{w1}(-|\s)({w2s})$".format(w1=w1, w2s='|'.join(w2_list))
    print(pattern)

Output:

^Monte(-|\s)(Foo|Bar)$
^Snake(-|\s)(Derp|Bar|Python|Monte)$
^Bar(-|\s)(Derp|Foo|Python|Monte|Snake)$
^Foo(-|\s)(Derp|Python|Monte|Snake)$
^Python(-|\s)(Foo|Bar|Monte|Snake)$
^Derp(-|\s)(Bar|Python|Monte|Snake)$

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow