Pergunta

I was trying to use this project :

https://github.com/UKPLab/sentence-transformers

for embedding non english sentences, the language is not a human speaking language, its machine language (x86)

but the problem is i cannot find a simple example where it shows how can i embed sentences using a custom dataset without any labels or similarity values of the sentences.

basically i have an array of sentences lists without any labels for sentences or similarity values for them, and i want to embed them into vectors in a way that it preserves the semantic of the sentence the best way possible, so far i have used word2vec and doc2vec using gensim library so i wanted to try this method to see if its any better?

Foi útil?

Solução

The link you provided of Siamese Bert is an instance of a Bert or Roberta finetuned on STS or NLI data. Which can have the format sentence 1 is similar 3 out of 5 to sentence 2 (STS). Hence, is supervised, it does not fit your purpose.

Nonetheless, do not despair, there are some that do not require training, although may not perform as good as the supervised one. The below use word embeddings which you can train on your data corpora to generate sentence embeddings:

Or by feeding just sentences line by line:

P.S. I have not tried all of the solutions, to my knowledge I suggest these, cause either they are quite known or are quite recent.

Licenciado em: CC-BY-SA com atribuição
scroll top