How should I use BERT embeddings for clustering (as opposed to fine-tuning BERT model for a supervised task)

https://datascience.stackexchange.com/questions/80595

13-12-2020
|

Pergunta

First of all, I want to say that I am asking this question because I am interested in using BERT embeddings as document features to do clustering. I am using Transformers from the Hugging Face library. I was thinking of averaging all of the Word Piece embeddings for each document so that each document has a unique vector. I would then use those vectors for clustering. Please feel free to comment if you think this is not a good idea, or if I am missing something or not understanding something.

The issue that I see with this is that you are only using the first N tokens which is specified by max_length in Hugging Face library. What if the first N tokens are not the best representation for that document? Wouldn't it be better to randomly choose N tokens, or better yet randomly choose N tokens 10 times?

Furthermore, I realize that using the WordPiece tokenizer is a replacement for lemmatization so the standard NLP pre-processing is supposed to be simpler. However, since we are already only using the first N tokens, and if we are not getting rid of stop words then useless stop words will be in the first N tokens. As far as I have seen, in the examples for Hugging Face, no one really does more preprocessing before the tokenization.

[See example below of the tokenized (from Hugging Face), first 64 tokens of a document]

Therefore, I am asking a few questions here (feel free to answer only one or provide references to papers or resources that I can read):

Why are the first N tokens chosen, instead of at random? 1a) is there anything out there that randomly chooses N tokens perhaps multiple times?
Similar to question 1, is there any better way to choose tokens? Perhaps using TF-IDF on the tokens to at least rule out certain useless tokens?
Do people generally use more preprocessing before using the Word Piece tokenizer?
To what extent does the choice of max_length affect performance?
Why is there a limit of 512 max length in Hugging Face library? Why not just use the length of the longest document?
Is it a good idea to average the WordPiece embeddings to get a matrix (if you want to do clustering)?
Is it a good idea to use BERT embeddings to get features for documents that can be clustered in order to find similar groups of documents? Or is there some other way that is better?

original: 'Trump tries to smooth things over with GOP insiders. Hollywood, Florida (CNN) Donald Trump\'s new delegate guru told Republican Party insiders at a posh resort here on Thursday that the billionaire front-runner is recalibrating the part "that he\'s been playing" and is ready

tokenized:

['[CLS]',
 'trump',
 'tries',
 'to',
 'smooth',
 'things',
 'over',
 'with',
 'go',
 '##p',
 'insider',
 '##s',
 '.',
 'hollywood',
 ',',
 'florida',
 '(',
 'cnn',
 ')',
 'donald',
 'trump',
 "'",
 's',
 'new',
 'delegate',
 'guru',
 'told',
 'republican',
 'party',
 'insider',
 '##s',
 'at',
 'a',
 'po',
 '##sh',
 'resort',
 'here',
 'on',
 'thursday',
 'that',
 'the',
 'billionaire',
 'front',
 '-',
 'runner',
 'is',
 'rec',
 '##ali',
 '##bra',
 '##ting',
 'the',
 'part',
 '"',
 'that',
 'he',
 "'",
 's',
 'been',
 'playing',
 '"',
 'and',
 'is',
 'ready',
 '[SEP]']

Solução

Here are the answers:

In sequence modeling, we expect a sentence to be ordered sequence, thus we cannot take random words (unlike bag of words, where we are just bothered about the words and not really the order). For example: In bag of words: "I ate ice-cream" and "ice-cream ate I" are same, while this is not true for the models that treat entire sentence as ordered sequence. Thus, you cannot pick N random words in a random order.
Choosing tokens is model dependent. You can always preprocess to remove stop words and other contents such as symbols, numbers, etc if it acts as noise than the information.
I would like to clarify that lemmatizing and word-piece tokenization is not the same. For example, in lemmatization "playing" and "played" are lemmatized to "play". But in case of word-piece tokenization it's likely split into "play"+"##ing" or "play"+"ed", depending on the vocabulary. Thus, there is more information preserved.
max_length should be optimally chosen such that most of you sentences are fully considered. (i.e, most of the sentences should be shorter than max_length after tokenization). There are some models which considers complete sequence length. Example: Universal Sentence Encoder(USE), Transformer-XL, etc. However, note that you can also use higher batch size with smaller max_length, which makes the training/fine-tuning faster and sometime produces better results.
The pretrained model is trained with MAX_LEN of 512. It's a model's limitation.
In specific to BERT,as claimed by the paper, for classification embeddings of [CLS] token is sufficient. Since, its attention based model, the [CLS] token would capture the composition of the entire sentence, thus sufficient. However, you can also average the embeddings of all the tokens. I have tried both, in most of my works, the of average of all word-piece tokens has yielded higher performance. Also, some work's even suggests you to take average of embeddings from the last 4 layers. It is merely a design choice.
Using sentence embeddings are generally okay. But, you need to verify with the literature. There can always be a better technique. Also, there are models specific to sentence embeddings (USE is one such model), you can check them out.

Outras dicas

Since many of your questions were answered already, I may only share my personal experience with your last question:

7) Is it a good idea to use BERT embeddings to get features for documents that can be clustered in order to find similar groups of documents? Or is there some other way that is better?

I think what a good idea would be is to start with simpler approaches. Especially when dealing with long documents relying on vectorisers such as tf-idf may lead to better results while having the advantages of less complexity and usually more interpretability.

I just finished a cluster exercise for longer documents and went through a similar thought process and experimentations. Eventually, I obtained the best results with tf-idf features. The pipeline I used consisted of:

Process data (stop-word removal, lemmatising, etc)
Fit tf-idf vectorizer (alternatively you may try also doc2vec).
Run some sort of dimension reduction algorithm (PCA in my case).
(K-means) clustering - evaluate optimal number of clusters.

If you are eager to use BERT with long documents in your down-stream task you may look at these two main approaches:

Truncation methods

head-only (first 512 tokens)
tail-only (last 512 tokens
head+tail

Depending on your domain, for example if each document is concluded with an executive summary, tail-only may improve results.

Hierarchical methods

mean pooling
max pooling

As stated here Truncation methods applies to the input of the BERT model (the Tokens), while the Hierarchical methods applies to the ouputs of the Bert model (the embbeding).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange