Learning character embeddings with GenSim

https://datascience.stackexchange.com/questions/19481

22-10-2019
|

Pergunta

I am learning deep learning, and as a first exercise to myself I am trying to build a system that learns a very simple task - capitalize the first letter of each word. As a first step, I am tried to create "character embeddings" - a vector for each character. I am using the following code:

import gensim
model = gensim.models.Word2Vec(sentences)

where sentences is a list of lists of chars which I took from this long Wikipedia page. For example, sentences[101] is:

[' ', ' ', ' ', ' ', 'S', 'p', 'e', 'a', 'k', 'i', 'n', 'g', ' ', 'a', 't', ' ', 't', 'h', 'e', ' ', 'c', 'o', 'n', 'c', 'l', 'u', 's', 'i', 'o', 'n', ' ', 'o', 'f', ' ', 'a', ' ', 'm', 'i', 's', 's', 'i', 'l', 'e', ' ', 'e', 'x', 'e', 'r', 'c', 'i', 's', 'e', ... ]

To test the model, I did:

model.most_similar(positive=['A', 'b'], negative=['a'], topn=3)

I hoped to get 'B' at the top, since 'A'-'a'+'b'='B', but I got:

[('D', 0.5388374328613281),
 ('N', 0.5219535827636719),
 ('V', 0.5081528425216675)]

(also, my capitalization application did not work so well, but this is probably because of the embeddings).

What should I do to get embeddings that identify capitalization?

Solução

I believe that you misunderstood the word2vec concept. Basically for words, the feature vector for a word is learnt from the surrounding words.

You shall know a word by the company it keeps- Firth.J.R

In your case characters have been used, so the feature vector for each character depends upon the adjacent characters present. your example might work, if you have the following training sentences.

ABCDEFGHI
abcdefghi
AbcDEFghi
aBcdEfgHI
abcDEFgHi
ABcdEFGHi

With these training sentences, the characters 'A','a','B','b' will preserve capital features and english alphabet order feature. But whereas, when trained with wikipedia sentences, the characters will preserve the probability of being present in a meaningful word. For instance, the closest letter to 'C' would be 'a','o','e' but hardly 'x' or 'd' because there would be words like 'covenant','country','cat' as no common words would be 'cx..'

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange