What I want to do is loop through my database searching each document for the presence of certain listed terms -- some of which I would like to be bigram and trigram if necessary. If the terms are present I will submit the document's index and blah blah blah.
I know NLTK offers an nltk.bigrams() call, but having never implemented it I can't get it to go and, even if I could, I have no idea how to ensure proper use. I'm hoping someone on SO can help.
Here is a simplified version of what my code looks like presently:
word_list = ['**live music'**, 'classical', 'local band', 'new album', 'punk
rock','pop music', 'rap', 'blues', 'electronic','original compositions', 'musical',
'russian music', 'music festival', 'start', '**rap battle**', 'country music', 'rapper
live', 'rap duo', 'r&b', 'live', 'music', 'bands', 'call', 'ska', 'electro', '**bluegrass
band**', 'reggae', 'play','latin','quintet', 'jazz', 'the piano', 'band', 'techno',
'facebook', 'reggae music', 'tribute band', 'must', 'backup band','country rock',
'last', 'rap live', 'country', 'concert series', 'metal', 'the depot', 'big band', 'hip
hop', 'rock', 'usually', 'gospel', '**upcoming release**']
idx_list = []
##initialize db cursor:
db_conn = crawler_library.connect_to_db("events")
cursor = db_conn.cursor()
##make query:
query = "SELECT event_title,description,extra_info,venue_name FROM events WHERE
events.idx in" + str(tuple(category_list)) #this will return *all* docs from this database.
#execute the query and catch any errors that show up and print them so I am not flying
blind
try:
cursor.execute(query)
except MySQLdb.Error, e:
print("MySQL Error [%d]: %s") % (e.args[0], e.args[1])
crawler_library.close_db_connection(db_conn)
#loop through all results in the query set, one row at-a-time
documents = []
if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the
database
data = cursor.fetchall()
for row in data:
temp_string = nltk.clean_html(str(row[0]).strip(string.punctuation).lower()+"
"+str(row[1]).strip(string.punctuation).lower() \
+" "+str(row[2]).strip(string.punctuation).lower() +"
"+str(row[3]).strip(string.punctuation)).lower().split()
fin_doc = ""
for word in temp_string:
if word not in stopwords and len(word) >= 3:
fin_doc += " " + word.strip(string.punctuation)
documents.append(fin_doc)
Thus, as I hope is clear from the code, I have a list of terms I'm searching for (word_list) -- some of which are bigrams (see highlighted), I am querying our database and from the documents (data) that it returns (for row in data), I am cleaning each one and building a new list (documents = []). I would like to search each document in my documents list for whether or not it has a term from my word_list (including the bigrams). I hope this is clear and can be easily solved.
My only question is how to use NLTK's bigram to determine whether any of the bigrams in my word_list are located within my documents list. Can someone please explain that? Thank you in advance.