Question

I am trying to classify texts using kmeans, TfidfVectorizer, PCA. However, it seems that many texts are not correctly classified as you can see: I have texts in cluster2 that should be in Cluster 0 or 1. My question is on how to improve the results, if increasing the number of clusters or adding more constraints (like specific words to look at for clustering texts).

Help and suggestions will be greatly appreciated.

1

Code:

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed


vect =TfidfVectorizer(tokenizer=preprocessing)
vectorized_text=vect.fit_transform(df['Text'])
kmeans =KMeans(n_clusters=n).fit(vectorized_text)


cl=kmeans.predict(vectorized_text)
df['Predicted']=pd.Series(cl, index=df.index)

df.groupby("Predicted").count()

kmeans_labels =KMeans(n_clusters=n).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(df['Text']).todense()
pca = PCA(n_components=n).fit(X)
data2D = pca.transform(X)


kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]
Was it helpful?

Solution

You have tagged this as text-classification, but describe your efforts to cluster the documents. My answer is based on the assumption that you don't know how the documents are grouped for you to classify new documents. The first step would be to determine the clusters/classes/groups and then classify new documents.

There are at least 2 things one can think of for this problem.

  1. Clustering Algorithm
  2. Data for the clustering algorithm

1. Clustering Algorithm

K-Means clustering is a simple and fast algorithm that produces adequate results. It is almost always the first tool that people reach for, but it has some limitations a) User must choose the number of clusters ahead of running it b) Data must not be very sparse c) Data must be reasonably distributed around the centroid - spherical in nature

See also, k-Means Advantages and Disadvantages.

Do you know apriori, how many clusters you have? You could experiment by running the algorithm multiple times with different cluster sizes and observing the silhouette score. Then choose the one with the best score or one that fits your expectations the best.

There are other algorithms that might suit your needs better, like DBSCAN or Agglomerative clustering. They don't require the user to choose number of parameters, but do require you to pass in metrics to determine cluster separation. I've used DBSCAN effectively, but don't have as much experience with Agglomerative clustering. Once again, you may have to experiment with the _min_samples_ and eps parameters for DBSCAN. SKlearn has good description in the Clustering User Guide.

2. Data for clustering algorithm

TF-IDF is a good metric for the text and it is a simple and quick way to produce. It is a representation of the different terms that occur with the documents and the weight associated with it. You have a few different options to try out

  1. sklearn.TfidfVectorizer can take a custom vocabulary. This will help by reducing the vocabulary to important terms. Also, as suggested by @Graph4Me, you should eliminate stopwords. Both of these techniques will reduce the sparsity of the matrix, in turn making your clustering algorithms more efficient.

  2. Change from a TF-IDF model to use an word embedding. This is much more complex, but you could use GloVe or BERT to produce them. Spacy can also produce an embedding vector.This will give you an contextual representation rather than just a count/density based on presence of words.

OTHER TIPS

It can be improved in many ways. However, the best start would be if you elaborate on your pipeline together with choosen parameters, as well as your data.

Try to use the concatenated vector of average word2vec and max pool word2vec. You may get better results than tfidf and PCA. Remove the words like stop words. Also, try with lemmatized words.
If you okay with computational time, go for transformer-based transfer learning embedding like BERT, ALBERT etc...

Different approaches can be applied. In addition to the other answers, you can check: http://brandonrose.org/clustering#Tf-idf-and-document-similarity

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top