how to return/search for documents using nltk bigrams?

https://stackoverflow.com/questions/17195084

01-06-2022
|

Question

What I want to do is loop through my database searching each document for the presence of certain listed terms -- some of which I would like to be bigram and trigram if necessary. If the terms are present I will submit the document's index and blah blah blah.

I know NLTK offers an nltk.bigrams() call, but having never implemented it I can't get it to go and, even if I could, I have no idea how to ensure proper use. I'm hoping someone on SO can help.

Here is a simplified version of what my code looks like presently:

word_list        = ['**live music'**, 'classical', 'local band', 'new album', 'punk
rock','pop music', 'rap', 'blues', 'electronic','original compositions', 'musical',
'russian music', 'music festival', 'start', '**rap battle**', 'country music', 'rapper
live', 'rap duo', 'r&b', 'live', 'music', 'bands', 'call', 'ska', 'electro', '**bluegrass
band**', 'reggae', 'play','latin','quintet', 'jazz', 'the piano', 'band', 'techno',
'facebook', 'reggae music', 'tribute band', 'must', 'backup band','country rock',
'last', 'rap live', 'country', 'concert series', 'metal', 'the depot', 'big band', 'hip
hop', 'rock', 'usually', 'gospel', '**upcoming release**']

idx_list         = []

##initialize db cursor:
db_conn = crawler_library.connect_to_db("events")
cursor  = db_conn.cursor()

##make query:
query = "SELECT event_title,description,extra_info,venue_name FROM events WHERE
events.idx in" + str(tuple(category_list)) #this will return *all* docs from this database.

#execute the query and catch any errors that show up and print them so I am not flying
blind
try:
    cursor.execute(query)
except MySQLdb.Error, e:
     print("MySQL Error [%d]: %s") % (e.args[0], e.args[1])
crawler_library.close_db_connection(db_conn)

#loop through all results in the query set, one row at-a-time
documents = []


if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the
database
    data = cursor.fetchall()
    for row in data:
         temp_string  = nltk.clean_html(str(row[0]).strip(string.punctuation).lower()+"
                        "+str(row[1]).strip(string.punctuation).lower() \
                        +" "+str(row[2]).strip(string.punctuation).lower() +"
                        "+str(row[3]).strip(string.punctuation)).lower().split()
         fin_doc   = ""
         for word in temp_string:
             if word not in stopwords and len(word) >= 3:
                 fin_doc += " " + word.strip(string.punctuation)
             documents.append(fin_doc)

Thus, as I hope is clear from the code, I have a list of terms I'm searching for (word_list) -- some of which are bigrams (see highlighted), I am querying our database and from the documents (data) that it returns (for row in data), I am cleaning each one and building a new list (documents = []). I would like to search each document in my documents list for whether or not it has a term from my word_list (including the bigrams). I hope this is clear and can be easily solved.

My only question is how to use NLTK's bigram to determine whether any of the bigrams in my word_list are located within my documents list. Can someone please explain that? Thank you in advance.

Solution

here's the answer I came up with (see description above (esp. the for loop) for better clarity):

for row in data:
    temp_string  = nltk.clean_html(str(row[0]).strip(string.digits + string.punctuation).lower() +" "+str(row[1]).strip(string.digits + string.punctuation).lower() \
    +" "+str(row[2]).strip(string.digits + string.punctuation).lower()+" "+str(row[3]).strip(string.digits + string.punctuation)).lower().split()
    temp_string     = [word for word in temp_string if word not in stopwords and len(word) >= 3]
    bigrams         = nltk.bigrams(word_tokenize(str(' '.join(temp_string))))
    all_terms_list  = temp_string + [str(bigram).replace(",","").replace("'", "").strip("()") for bigram in bigrams]
    [live_music_idx_list.append(row[4]) for word in live_music_word_list if word in all_terms_list]

if anyone knows how I can better optimize this code or if I am screwing something up (the string.replace().replace() is pretty risible), I welcome the feedback. Thanks.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow