StandardTokenizer
should create tokens around newlines, spaces, etc., and the stopword filter looks, at a glance, like it should be working correctly. You should probably include a LowercaseFilter
above your StopwordFilter
, to prevent those matches from being case sensitive, though.
I wonder if a pertinent question might be: What do you mean by "removed"? Analysis only affects the indexed representation of the field. It does not affect the stored version that you retrieve from the index in any way. It is meant to facilitate searching, not transform the stored version of the text. If you remove the word "the" through the filter, you should no longer get any hits on the word "the" while searching, but you will still see in when you retrieve the document from the index.