Text processing

https://datascience.stackexchange.com/questions/16713

16-10-2019
|

Question

I am completely new to analyze cluster texts, I'm using Goodreads API to get Books synopsis. My goal is to group similar books, for example:

Politics
Music
Biographies etc...

While Goodreads provide genre, I would like to use synopsis and use the text for this. Lets say I will get N books synopsis like this:

<description>
<![CDATA[
<b>Alternate cover edition can be found <a href="https://www.goodreads.com/book/show/10249685-dune" rel="nofollow">here</a>. </b> and <a href="https://www.goodreads.com/book/show/11273438-dune" rel="nofollow">here</a><br /><br />Here is the novel that will be forever considered a triumph of the imagination. Set on the desert planet Arrakis, <b>Dune</b> is the story of the boy Paul Atreides, who would become the mysterious man known as Muad'Dib. He would avenge the traitorous plot against his noble family--and would bring to fruition humankind's most ancient and unattainable dream.<br />A stunning blend of adventure and mysticism, environmentalism and politics, Dune won the first Nebula Award, shared the Hugo Award, and formed the basis of what it undoubtedly the grandest epic in science fiction.
]]>
</description>

I have read about cosine similarity and new google NLP. But I want to start with this:

Represent books description (features, usually a bag of words with TF-IDF)
Calculate similarity between two books (cosine similarity)

Questions:

What's the most efficient algorithm to create a matrix of cosine similarity between all books (N)
How to cluster books together based on the above?

Any other ideas will be great.

Solution

Since you are going to use TF-IDF representations, you already have a feature matrix. To calculate cosine similairty between all vectors, you can use:

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(tfidfmat)
#tfidfmat is your TF-IDF matrix

#Use numpy arrays

To begin clustering, you can use K-means algorithm to begin with, and use cosine similairty as the distance metric. Here's an example from scikit-learn itself on clustering documents.

Further things to try: If you find the above methods not working to your expectations, look into word2vec and doc2vec, and instead of using tfidf, which isa Bag of Words approach, use word vector representations. Here is a good blog explaining the concept.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange