Getting sentence embeddings with sentence_transformers

https://datascience.stackexchange.com/questions/81089

13-12-2020
|

문제

I have a text column in my data frame which contains paragraph(s) having multiple and variable sentences in each instance/example/row of the dataframe. Then, I created the sentence tokens of that paragraph using sent_tokenizer of nltk and put it into another column.

So my data frame looks like this:

index       text                                              class

0           ["Hello i live in berlin", 'I'm xxx']                                                          1
1           ["My name is xx", "I have a cat", "Love is life"]                                              0

now when I'm using:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
sentences = df['text']
sentences = sentences.tolist()
embeddings = model.encode(sentences)

I'm getting:

TypeError: expected string or bytes-like object

The encode method is not taking a list of list of sentences as an argument.

해결책 2

I finally solved this problem.

My dataframe looks like this:

index       text                                              class

0           ["Hello i live in berlin", 'I'm xxx']                                     1
1           ["My name is xx", "I have a cat", "Love is life"]                         0

Text column contains list of sentences in each row. I applied following function:

df['Embeddings'] = df['text'].apply(lambda x: model.encode(x))

It created a new column called Embeddings. Embeddings column now contains list of vectors of size 768 for each row. Now I will apply the average function using lambda on my newly created Embedding column's each element which will create a single vector of length 768 for each row and then I will store it in a new column, lets say 'X'. I will then feed X to the SVM along with class labels.

Basically, what we are doing is, averaging the number of embedding vectors generated for sentences in text column.

So, for e.g. : lets say for index 0 in df['text'] we have two sentences:

["Hello i live in berlin", 'I'm xxx']

Now, after encoding, it will look something like this:

 [v1,v2] # where length of v1 and v2 vectors is 768

Next, we will take average of these two vectors using np.average. It will result in a single vector:

[v]

This single vector can now be easily fed to SVM. of-course, we will do this for all the rows and then feed it to the SVM.

다른 팁

The encode method only works with single sentences as strings, i.e., you need to call it for each sentence independently:

embeddings = [model.encode(s) for s in sentences]

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange