Text Mining on huge list of strings

https://stackoverflow.com/questions/7302594

22-10-2019
|

Question

I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:

1,Hi

2,Hi How r u?

2,How r u?

3,where r u?

3,what does this mean

3,what it means

Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way

1-Hi

2-Hi How r u?

 ----How r u?

3-What does this mean?

 ----what it means?

3-Where are you?

This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.

Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines. and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and seq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.

Thanks & Regards, Atul

Solution

I think that what you really need is hierarchical clustering. There was one implementation proposed for Mahout, one is also implemented in Shogun Toolbox (also designed for large-scale computation). But it's hard to guarantee that it will work, because the input seems to be hard.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow