Question

I have searched some time about this matter and didn't find proper answer anywhere.

Let's say I have a string:

"The quick brown fox jumps over the lazy dog"

I need to find unique words in this string and their byte positions and also byte distance between same words.

Ok I can manage to find words, but what is their byte position and any ideas to track distance in bytes? Is for example: 5 is the position of string quick and converted to bytes?

I hope this doesn't sound too stupid (I am fairly new to Java).

Was it helpful?

Solution

Finding unique words should be fairly easy; split on whitespace, add strings to a Set, and whatever's in the Set at the end of the method will be the unique words in the file. this can be made arbitrary complex though, depending on what defines a unique word, and if characters other than whitespace separate words.

The byte position/distance question is a bit harder. If memory serves, String objects in Java are wrappers around char[] objects, and chars are 16-bit unicode characters in Java (http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).

So I'm guessing byte distance is just a linear function of the character position?

If you're working with other encodings though the getBytes() method might be useful.

http://docs.oracle.com/javase/tutorial/i18n/text/string.html

So for something like that, a naive solution would be to determine the number of bytes for each character, which would allow for really easy calculation of byte positions/distances, but determining that probably isn't that efficient. It should, however, yield correct results if done correctly.

OTHER TIPS

Positions are counted from 0, not 1. So "quick" would have character position 5, which for US-ASCII is also the byte position. Maybe character positions suffice.

String s = "The quick brown fox jumps over the lazy dog";
int charsIndex = s.indexOf("quick"); // 4
int charsLength = "The ".length(); // 4
int bytesLength = "The ".getBytes("UTF-8").length; // 4
char ch = s.charAt(4); // 'q'
int c = s.codePointAt(4); // (int) 'q'

In Java text (String) is always in Unicode, hence all chars are possible and combinable. Bytes (byte[]) are in some encoding and may vary per encoding.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top