Construct word2vec (CBOW) training data from beginning of sentence

https://datascience.stackexchange.com/questions/81249

13-12-2020
|

Pergunta

When constructing training data for CBOW, Mikolov et al. suggest using the word from the center of a context window. What is the "best" approach to capturing words at the beginning/end of a sentence (I put best in quotes because I'm sure this depends on the task). Implementations I see online do something like the this:

for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]

I see two issues arising from this approach.

Issue 1: The approach gives imbalanced focus to the middle of the sentence. For example, the first word of the sentence can only appear in 1 context window and will never appear as the target word. Compare this to the 4th word in the sentence which will appear in 4 context windows and will also be a target word. This will be an issue as some words appear frequently at the beginning of sentences (i.e. however, thus, etc.). Wouldn't this approach minimize their use?
Issue 2: Sentences with 4 or fewer words are completely ignored, and the importance of short sentences is minimized. For example, a sentence with 5 words can only contribute one training sample while a sentence of length 8 will contribute 4 training samples.

Can anyone offer insight as to how much these issues affect the results or any alternative approaches for constructing the training data? (I considered letting the first word be the target word and using the next N words as the context, but this creates issues of it's own).

Note: I also asked this question on Stack Overflow: https://stackoverflow.com/questions/63747999/construct-word2vec-cbow-training-data-from-beginning-of-sentence

Solução

Here is a great answer to this question. I'll summarize:

The code example was taken from a "buggy" repository on GitHub and is not typical of robust solutions.
Robust solutions actually do use the first word as a target word. If the context window is length 10, then the method uses the next 5 words as the context and the first word as the target (it won't actually have a context of size 10 since the first half of the context doesn't exist).
Even though the first few words in the sentence are used as target words, they still will not appear in as many contexts. This issue is mitigated because they appear in smaller contexts. Since the contexts they appear in are smaller, they have more impact on the context and, therefore, more significance in back propagation.
Many of the more robust implementations will use an entire paragraph or document as opposed to a sentence (some even include punctuation). This make sense because the ending of one sentence may give context for the beginning of another sentence. When this approach is implemented, there are far fewer start/ending words, which reduces the issue.

The answer linked above has some other helpful details and is worth reading.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange