Question

In the original Skipgram/CBOW both context word and target word are represented as one-hot encoding.

Does fasttext also use one-hot encoding for each subword when training the skip-gram/CBOW model (so the length of the one-hot encoding vector is |Vocab| + |all subwords|)? If they use it, do they use it in both context and target words?

Was it helpful?

Solution

A quick answer would be No.

Let's walk through how FastText works internally:

For representation purposes, FastText internally initializes a dictionary. Dictionary contains all the collection of words. Besides words, it also maintains the count of all the words in the dictionary (and other information). Every time a new word is added to the dictionary its size is increased and word2int_ is updated to size_(which is increased after assignment).

The code below adds a word into the dictionary.

// adding of new word
void Dictionary::add(const std::string& w) {
  int32_t h = find(w);
  ntokens_++;
  if (word2int_[h] == -1) {
    entry e;
    e.word = w;
    e.count = 1;
    e.type = getType(w);
    words_.push_back(e);
    word2int_[h] = size_++; //  word2int_[h] is assigned a uniuqe value 
  } else {
    words_[word2int_[h]].count++; // word's count is being updated here
  }
}

// funciton used to access word ID (which is the representation used)
int32_t Dictionary::getId(const std::string& w) const {
  int32_t h = find(w);
  return word2int_[h];
}

As mentioned in this medium article, word2int_ is indexed on the hash of the word string, and stores a sequential int index to the words_ array. The maximum size of word2int_ vector can be 30000000.

For embeddings matrix of M x N is created where M = MAX_VOCAB_SIZE + bucket_size. Where, M is total vocabulary size including bucket_size corresponds to the total size of array allocated for all the n-gram tokens and N is the dimension of the embedding vector, which means the representation for one word requires the size of 1.

The code below shows how to compute the hash and calculates the ID of the subword. A similar logic is used to access the subword vector. Note here h is an integer value which is calculated using dict_->hash(). This function returns the same h value which is used when adding a word in the dictionary. This makes the process of accessing word IDs only dependent on the value of h.


int32_t FastText::getSubwordId(const std::string& subword) const {
  int32_t h = dict_->hash(subword) % args_->bucket;
  return dict_->nwords() + h;
}

void FastText::getSubwordVector(Vector& vec, const std::string& subword) const {
  vec.zero();
  int32_t h = dict_->hash(subword) % args_->bucket;
  h = h + dict_->nwords();
  addInputVector(vec, h);
}

Long story short, FastText makes use of integer IDs assigned at the beginning and use those IDs to access the embeddings.

I hope this helps. All the code samples are taken from FastText repo. Feel free to dive in to understand more.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top