Question

I'm new to data mining. I want to detect topic transition among consecutive sentences. For instance, I have a paragraph (this could be a collection of dozens of sentences, sometimes without transitional words) as follows:

As I really like Mickey Mouse, I was hopping to go to Florida. But my dad took me to Nevada. Obviously, Mickey Mouse was not there. But, I attended a camp with other children. And, I really enjoyed and learnt a lot from my camp.

Here, I want to automatically split this into following sub-paraphs:

  1. As I really like Mickey Mouse, I was hopping to go to Florida. But my dad took me to Nevada. Obviously, Mickey Mouse was not there.

  2. But, I attended a camp with other children. And, I really enjoyed and learnt a lot from my camp.

As far as I know, this is not the sentence similarity measurement. What technique should be used here? Any example using python or tensorflow models would be greatly appreciated.

Was it helpful?

Solution

One solution could be to:

  1. Get sentence embeddings from FastText
  2. Compute Euclidean Distance between the consecutive sentences
  3. If the distance between the consecutive sentences is close to 1, then, you may say the two sentences are talking about different topics.

See here how to compute sentence embeddings for the English language: https://github.com/facebookresearch/fastText/blob/5b5943c118b0ec5fb9cd8d20587de2b2d3966dfe/python/fasttext_module/fasttext/FastText.py#L127

fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')
fasttext.util.reduce_model(ft, 20)
def get_fasttext_sentence_embedding(sentence, ft):
    if pd.isna(sentence):
        return np.zeros(20)
    return ft.get_sentence_vector(sentence)

Then, compute the euclidian distance between fast text embeddings of consecutive sentences.

The same can be done using LDA (topic model), but, that would require a lot of text to model the topics.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top