Getting sentence embeddings with sentence_transformers

https://datascience.stackexchange.com/questions/81089

13-12-2020
|

Domanda

I have a text column in my data frame which contains paragraph(s) having multiple and variable sentences in each instance/example/row of the dataframe. Then, I created the sentence tokens of that paragraph using sent_tokenizer of nltk and put it into another column.

So my data frame looks like this:

index       text                                              class

0           ["Hello i live in berlin", 'I'm xxx']                                                          1
1           ["My name is xx", "I have a cat", "Love is life"]                                              0

now when I'm using:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
sentences = df['text']
sentences = sentences.tolist()
embeddings = model.encode(sentences)

I'm getting:

TypeError: expected string or bytes-like object

The encode method is not taking a list of list of sentences as an argument.

Soluzione 2

I finally solved this problem.

My dataframe looks like this:

index       text                                              class

0           ["Hello i live in berlin", 'I'm xxx']                                     1
1           ["My name is xx", "I have a cat", "Love is life"]                         0

Text column contains list of sentences in each row. I applied following function:

df['Embeddings'] = df['text'].apply(lambda x: model.encode(x))

It created a new column called Embeddings. Embeddings column now contains list of vectors of size 768 for each row. Now I will apply the average function using lambda on my newly created Embedding column's each element which will create a single vector of length 768 for each row and then I will store it in a new column, lets say 'X'. I will then feed X to the SVM along with class labels.

Basically, what we are doing is, averaging the number of embedding vectors generated for sentences in text column.

So, for e.g. : lets say for index 0 in df['text'] we have two sentences:

["Hello i live in berlin", 'I'm xxx']

Now, after encoding, it will look something like this:

 [v1,v2] # where length of v1 and v2 vectors is 768

Next, we will take average of these two vectors using np.average. It will result in a single vector:

[v]

This single vector can now be easily fed to SVM. of-course, we will do this for all the rows and then feed it to the SVM.

Altri suggerimenti

The encode method only works with single sentences as strings, i.e., you need to call it for each sentence independently:

embeddings = [model.encode(s) for s in sentences]

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange