What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.