Finding unique words should be fairly easy; split on whitespace, add strings to a Set
, and whatever's in the Set
at the end of the method will be the unique words in the file. this can be made arbitrary complex though, depending on what defines a unique word, and if characters other than whitespace separate words.
The byte position/distance question is a bit harder. If memory serves, String
objects in Java are wrappers around char[]
objects, and chars
are 16-bit unicode characters in Java (http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).
So I'm guessing byte distance is just a linear function of the character position?
If you're working with other encodings though the getBytes()
method might be useful.
http://docs.oracle.com/javase/tutorial/i18n/text/string.html
So for something like that, a naive solution would be to determine the number of bytes for each character, which would allow for really easy calculation of byte positions/distances, but determining that probably isn't that efficient. It should, however, yield correct results if done correctly.