Words to numbers faster lookup

https://datascience.stackexchange.com/questions/16348

16-10-2019
|

문제

I'm training an LSTM for sentiment analysis on a review dataset downloaded from here. The music review dataset contains about 150K data points (reviews of varying length labelled pos or neg). After creating a dictionary, I'm running a script in Python to replace strings (words) with numbers that keras/theano will embed later.

The problem is that such a large dataset requires a lot of time for lookup. I would appreciate if anyone had suggestion on a tool for faster lookup or similar. Currently I just loop through every word in the corpus and replace it with the corresponding number from the dictionary (1-hot encoding essentially)

EDIT:

I'm doing roughly the following: each Python list is a sentence (before tokenization here):

['noble', 'interesting_superlatives',...,'the_idea']

which I want to conver to a list of integers, like:

[143599, 12387,...,7582]

I referred to it (probably incorrectly) as one-hot encoding because for each word there is exactly one number in the dictionary.

해결책

I'd like to extend the great @Emre's answer with another example - we are going to replace all tokenized words from the "1984" (c) George Orwell (120K words):

In [163]: %paste
import requests
import nltk
import pandas as pd

# source: https://github.com/dwyl/english-words
fn = r'D:\temp\.data\words.txt'
url = 'http://gutenberg.net.au/ebooks01/0100021.txt'

r = requests.get(url)

# read words into Pandas DataFrame
df = pd.read_csv(fn, header=None, names=['word'])
# shuffle DF, so we will have random indexes
df = df.sample(frac=1)
# convert Pandas DF into dictionary: {'word1': unique_number1, 'word2': unique_number2, ...}
lkp = df.reset_index().set_index('word')['index'].to_dict()

# tokenize "1984" (c) George Orwell
words = nltk.tokenize.word_tokenize(r.text)

print('Word Dictionary size: {}'.format(len(lkp)))
print('We have tokenized {} words...'.format(len(words)))
## -- End pasted text --
Word Dictionary size: 354983
We have tokenized 120251 words...

In [164]: %timeit [lkp.get(w, 0) for w in words]
10 loops, best of 3: 66.3 ms per loop

Conclusion: it took 66 ms to build a list of numbers for the list with 120K words from the dictionary containing 354.983 entries.

다른 팁

You're doing something wrong. I can query a 100K word dict in nanoseconds

word_list = open('/usr/share/dict/words').read().split()
len(word_list)

> 99171

word_dict = {word: hash(word) for word in word_list}
%timeit word_dict['blazing']

> 10000000 loops, best of 3: 33.8 ns per loop

You could use a trie from the Wikipedia definition:

is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

pygtrie offers an implementation of tries with a dict interface. Here goes an example

import pygtrie as trie

words = ['cat', 'caterpillar', 'dog', 'mouse']

structure = trie.Trie()

for i, word in enumerate(words):
   structure[word] = i

print structure['caterpillar']

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange