Question

I want to train a word predictability task to generate word embeddings. The document collection contains 243k documents. The code implementation is in torch. I am struggling with the huge size of the dataset and need ideas on how to train word embeddings on such a large dataset which is a collection of 243 thousand full article documents. The research computing resource is timed so get short access to GPU node and so opted for Incremental model training:

  1. Incremental Model training: One way to train on entire dataset is to use incremental model training that is train the model on one chunk of the data and save it. Later on pick up the same pre-trained model and start training on it next chunk of data. The problem that I am facing in this approach is that how do I maintain the vocabulary/dictionary of words. In word embedding methods dictionary/vocab plays an important role. We sweep over all documents and create vocab of words that have count greater than a minimum set frequency. Now, actually this vocab is a hash map which has index associated to each word and in training samples we replace words with their indices in the vocab for the simplicity in the model. In the case of incremental training how do I create dictionary incrementally? Do I have to create vocab/dictionary on entire documents initially and then train incrementally? Or is the a way to extend vocab also in incremental training?
  2. Another problem is memory limit on the size of the vocab data structure. I am implementing my model in Torch which is LUA based. So, LUA puts a limit on tables size, I cannot load vocab for entire documents in a single table. How to overcome such memory issues?
  3. Taking inspiration from Glove vectors. In their paper they say that they “We trained our model on five corpora of varying sizes: a 2010 Wikipedia dump with 1 billion tokens; a 2014 Wikipedia dump with 1.6 billion to- kens; Gigaword 5 which has 4.3 billion tokens; the combination Gigaword5 + Wikipedia2014, which has 6 billion tokens; and on 42 billion tokens of web data, from Common Crawl5. We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words6, and then construct a matrix of co- occurrence counts X” . Any idea on how Glove vectors trained on such a big corpus and big vocabulary and how memory restrictions in their case might have got handled? Paper reference - http://nlp.stanford.edu/pubs/glove.pdf
  4. Any ideas on how to limit the size of the dataset for generating word embeddings? How would it affect the performance or coverage of word embeddings with the increase or decrease in number of documents? Is it a good idea to use sampling techniques to sample documents from dataset? If yes, then please suggest some of the sampling techniques.
Was it helpful?

Solution

You may be able to overcome the built-in memory limit with the tds library, which lets you build more-or-less equivalent structures that aren't limited by Lua's memory cap. This won't help with the limits of your hardware, but you will be able to have things like tables that are bigger than 2 GB.

https://github.com/torch/tds

Also, if all else fails, you could consider partitioning your vocabulary into smaller tables. Then, when you need to access it, you'd have some sort of master table in which you'd look up the correct vocab table for the element you're looking for. This would require sorting your vocabulary, so you'd still have to have it all in memory at once (or implement your own crafty sorting algorithm, I guess), but you'd only need to do that once for as long as your vocabulary stays constant. Then you serialize all the vocab tables and load them off disk as-needed, which will be slow, but still faster than filling up your physical memory and eating up swap space. Probably.

OTHER TIPS

I'm not familiar with Torch, but since basically word2vec and doc2vec are considered, These models learn from each sentences and so there is no need to have all the sentences in the memory. You could iterate via each sentence in the corpora and let the model learn from each of the sentence. And that is probably how people train on huge corpora with or without high computation machines.

A short example in python:

class SentenceIterator(object):
   def __iter__(file_name):
      for line in open(file_name)
         yield line

sentences = SentenceIterator("path/to/file")
for line in sentences:
    model.train(line)

In this way, The memory is loaded with only one sentence at a time and when it is done the memory loads the next one. For building the vocabulary, you can do the whole iterating via all the documents to build the vocab first and then train the data, depending upon the word-embedding functions implemented.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top