Byte position in Java

Question 1

Finding unique words should be fairly easy; split on whitespace, add strings to a Set, and whatever's in the Set at the end of the method will be the unique words in the file. this can be made arbitrary complex though, depending on what defines a unique word, and if characters other than whitespace separate words.

The byte position/distance question is a bit harder. If memory serves, String objects in Java are wrappers around char[] objects, and chars are 16-bit unicode characters in Java (http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).

So I'm guessing byte distance is just a linear function of the character position?

If you're working with other encodings though the getBytes() method might be useful.

http://docs.oracle.com/javase/tutorial/i18n/text/string.html

So for something like that, a naive solution would be to determine the number of bytes for each character, which would allow for really easy calculation of byte positions/distances, but determining that probably isn't that efficient. It should, however, yield correct results if done correctly.

Question 2

Positions are counted from 0, not 1. So "quick" would have character position 5, which for US-ASCII is also the byte position. Maybe character positions suffice.

String s = "The quick brown fox jumps over the lazy dog";
int charsIndex = s.indexOf("quick"); // 4
int charsLength = "The ".length(); // 4
int bytesLength = "The ".getBytes("UTF-8").length; // 4
char ch = s.charAt(4); // 'q'
int c = s.codePointAt(4); // (int) 'q'

In Java text (String) is always in Unicode, hence all chars are possible and combinable. Bytes (byte[]) are in some encoding and may vary per encoding.