Words to numbers faster lookup
-
16-10-2019 - |
문제
I'm training an LSTM for sentiment analysis on a review dataset downloaded from here. The music review dataset contains about 150K data points (reviews of varying length labelled pos or neg). After creating a dictionary, I'm running a script in Python to replace strings (words) with numbers that keras/theano will embed later.
The problem is that such a large dataset requires a lot of time for lookup. I would appreciate if anyone had suggestion on a tool for faster lookup or similar. Currently I just loop through every word in the corpus and replace it with the corresponding number from the dictionary (1-hot encoding essentially)
EDIT:
I'm doing roughly the following: each Python list is a sentence (before tokenization here):
['noble', 'interesting_superlatives',...,'the_idea']
which I want to conver to a list of integers, like:
[143599, 12387,...,7582]
I referred to it (probably incorrectly) as one-hot encoding because for each word there is exactly one number in the dictionary.
해결책
I'd like to extend the great @Emre's answer with another example - we are going to replace all tokenized words from the "1984" (c) George Orwell (120K words):
In [163]: %paste
import requests
import nltk
import pandas as pd
# source: https://github.com/dwyl/english-words
fn = r'D:\temp\.data\words.txt'
url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
r = requests.get(url)
# read words into Pandas DataFrame
df = pd.read_csv(fn, header=None, names=['word'])
# shuffle DF, so we will have random indexes
df = df.sample(frac=1)
# convert Pandas DF into dictionary: {'word1': unique_number1, 'word2': unique_number2, ...}
lkp = df.reset_index().set_index('word')['index'].to_dict()
# tokenize "1984" (c) George Orwell
words = nltk.tokenize.word_tokenize(r.text)
print('Word Dictionary size: {}'.format(len(lkp)))
print('We have tokenized {} words...'.format(len(words)))
## -- End pasted text --
Word Dictionary size: 354983
We have tokenized 120251 words...
In [164]: %timeit [lkp.get(w, 0) for w in words]
10 loops, best of 3: 66.3 ms per loop
Conclusion: it took 66 ms to build a list of numbers for the list with 120K words from the dictionary containing 354.983 entries.
다른 팁
You're doing something wrong. I can query a 100K word dict in nanoseconds
word_list = open('/usr/share/dict/words').read().split()
len(word_list)
> 99171
word_dict = {word: hash(word) for word in word_list}
%timeit word_dict['blazing']
> 10000000 loops, best of 3: 33.8 ns per loop
You could use a trie from the Wikipedia definition:
is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.
pygtrie offers an implementation of tries with a dict interface. Here goes an example
import pygtrie as trie
words = ['cat', 'caterpillar', 'dog', 'mouse']
structure = trie.Trie()
for i, word in enumerate(words):
structure[word] = i
print structure['caterpillar']