Getting sentence embeddings with sentence_transformers
-
13-12-2020 - |
문제
I have a text column in my data frame which contains paragraph(s) having multiple and variable sentences in each instance/example/row of the dataframe. Then, I created the sentence tokens of that paragraph using sent_tokenizer
of nltk and put it into another column.
So my data frame looks like this:
index text class
0 ["Hello i live in berlin", 'I'm xxx'] 1
1 ["My name is xx", "I have a cat", "Love is life"] 0
now when I'm using:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
sentences = df['text']
sentences = sentences.tolist()
embeddings = model.encode(sentences)
I'm getting:
TypeError: expected string or bytes-like object
The encode method is not taking a list of list of sentences as an argument.
해결책 2
I finally solved this problem.
My dataframe looks like this:
index text class
0 ["Hello i live in berlin", 'I'm xxx'] 1
1 ["My name is xx", "I have a cat", "Love is life"] 0
Text column contains list of sentences in each row. I applied following function:
df['Embeddings'] = df['text'].apply(lambda x: model.encode(x))
It created a new column called Embeddings. Embeddings column now contains list of vectors of size 768 for each row. Now I will apply the average function using lambda on my newly created Embedding column's each element which will create a single vector of length 768 for each row and then I will store it in a new column, lets say 'X'. I will then feed X to the SVM along with class labels.
Basically, what we are doing is, averaging the number of embedding vectors generated for sentences in text column.
So, for e.g. : lets say for index 0 in df['text'] we have two sentences:
["Hello i live in berlin", 'I'm xxx']
Now, after encoding, it will look something like this:
[v1,v2] # where length of v1 and v2 vectors is 768
Next, we will take average of these two vectors using np.average. It will result in a single vector:
[v]
This single vector can now be easily fed to SVM. of-course, we will do this for all the rows and then feed it to the SVM.
다른 팁
The encode
method only works with single sentences as strings, i.e., you need to call it for each sentence independently:
embeddings = [model.encode(s) for s in sentences]