Question

I hope you can help me with this problem. What I intend to do: Given a right text, I want to count the frequencies for every stemmized token ngrams without the stopwords(in other words, the stopwords are already removed).

This is the situation: I am indexing some texts with IndexWriter using ShingleAnalyzerWrapper + StandardAnalyzer and when I add a document to IndexWriter(like this: indexwriter.addDocument(doc, analyzer); where analyzer is again, ShingleAnalyzerWrapper + StandardAnalyzer ).

But the problem is: When I get the term frequencies and the terms, the stopwords seem to be substituted by underlines.

This is the input:
String text = "to i want to to i want to linked";
String text2 = "super by by hard easy ";

This is the output:
term:|freq:6
term:
_|freq:2
term:_ hard|freq:1
term:_ i|freq:2
term:_ link|freq:1
term:easy|freq:1
term:hard|freq:1
term:hard easy|freq:1
term:i|freq:2
term:i want|freq:2
term:link|freq:1
term:super|freq:1
term:super _|freq:1
term:want|freq:2
term:want _|freq:2

If anything was unclear, please ask me so I try to make myself more clear

Thanks for the help

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top