Question

Anyone have an idea about what could be going on here?

The first block shows what I would generally expect to see - the first character of a string is in index '0', with the 'problem' string commented out, replaced by the exact same thing, however never run before.

public void finderTest(){
    String theDoc = "Hello, I want this to work, and work well! Do you think it will work, and if not, why not?";
    //String wordOne = "‭abc"; // old, pre-used string, used to hold a comma.
    String wordOne = "abc";// new, never run before with a comma
    String wordTwo = "and";
    System.out.println("Type of character at index '0' in theDoc: "+Character.getType(theDoc.charAt(0)));
    System.out.println("Character at index '0' in theDoc: "+theDoc.charAt(0));
    System.out.println();
    System.out.println("All of wordOne: "+"'"+wordOne+"'");
    System.out.println("Type of character at index '0' in wordOne: "+Character.getType(wordOne.charAt(0)));
    System.out.println("Character at index '0' in wordOne: "+wordOne.charAt(0));
    System.out.println();
    System.out.println("Type of Character at index '0' in wordTwo: "+Character.getType(wordTwo.charAt(0)));
    System.out.println("Character at index '0' in wordTwo: "+wordTwo.charAt(0));
}

Which gives output:

/*
    Type of character at index '0' in theDoc: 1
Character at index '0' in theDoc: H

All of wordOne: 'abc'
Type of character at index '0' in wordOne: 2 // okay
Character at index '0' in wordOne: a // okay

Type of Character at index '0' in wordTwo: 2
Character at index '0' in wordTwo: a
*/

The second block has the 'new' string commented out, and the first character of 'wordOne' is nothing. It isn't a null character, or newline. I had been using that variable to find commas in 'theDoc'… but when I ran it, index '0' held nothing, and index 1 had the comma in it. If i copy and paste the string, the problem remains. However, commenting it out / deleting it, gets rid of the issue.

    public void finderTest(){
    String theDoc = "Hello, I want this to work, and work well! Do you think it will work, and if not, why not?";
    String wordOne = "‭abc"; // now running old string, used to hold comma
    //String wordOne = "abc"; 
    String wordTwo = "and";
    System.out.println("Type of character at index '0' in theDoc: "+Character.getType(theDoc.charAt(0)));
    System.out.println("Character at index '0' in theDoc: "+theDoc.charAt(0));
    System.out.println();
    System.out.println("All of wordOne: "+"'"+wordOne+"'");
    System.out.println("Type of character at index '0' in wordOne: "+Character.getType(wordOne.charAt(0)));
    System.out.println("Character at index '0' in wordOne: "+wordOne.charAt(0));
    System.out.println();
    System.out.println("Type of Character at index '0' in wordTwo: "+Character.getType(wordTwo.charAt(0)));
    System.out.println("Character at index '0' in wordTwo: "+wordTwo.charAt(0));
}

Which gives output:

/*  
    Type of character at index '0' in theDoc: 1
    Character at index '0' in theDoc: H

    All of wordOne: '‭abc'
    Type of character at index '0' in wordOne: 16 // What does this mean?
    Character at index '0' in wordOne: ‭   // where is the a? (well, its in wordOne index '1'... but why??)

    Type of Character at index '0' in wordTwo: 2
    Character at index '0' in wordTwo: a
*/

Is there something about commas or symbols in java that would cause an issue like this? I tried using character arrays, cleaning the workspace to re-build everything, and nothing has changed this… Which is a huge problem for finding indices of 'ngrams' within sentences, when some grams are things like ", and". At one point last night, it was working, and then all of a sudden started not working. I'm quite confused.

Any ideas?

Thanks,

Andrew

Was it helpful?

Solution

Character type 16 corresponds to Unicode DIRECTIONALITY_RIGHT_TO_LEFT_EMBEDDING (U+202B). It's an unprintable character; you can print it's hex value to confirm.

OTHER TIPS

I tried pasting your example into Eclipse and it told me this:

Some characters cannot be mapped using "Cp1252" character encoding.

and pointed me to the first character in the string:

String wordOne = "abc";

It appears there is a hidden (non-printable) character between the " and the a.

Your string contains a character you're having trouble seeing (before the 'a'). There are dozens of characters in the Unicode set which have no meaningful visual representation - this is probably one of them.

The '16' is the character type, for example:

COMBINING_SPACING_MARK, CONNECTOR_PUNCTUATION, CONTROL, CURRENCY_SYMBOL, DASH_PUNCTUATION, DECIMAL_DIGIT_NUMBER, ENCLOSING_MARK, END_PUNCTUATION, FINAL_QUOTE_PUNCTUATION, FORMAT, INITIAL_QUOTE_PUNCTUATION, LETTER_NUMBER, LINE_SEPARATOR, LOWERCASE_LETTER, MATH_SYMBOL, MODIFIER_LETTER, MODIFIER_SYMBOL, NON_SPACING_MARK, OTHER_LETTER, OTHER_NUMBER, OTHER_PUNCTUATION, OTHER_SYMBOL, PARAGRAPH_SEPARATOR, PRIVATE_USE, SPACE_SEPARATOR, START_PUNCTUATION, SURROGATE, TITLECASE_LETTER, UNASSIGNED, UPPERCASE_LETTER

All of which are defined in the Character class. I can't tell you which one it is, because that's implementation-dependent in theory; you should check against those values. Or, better yet, use Character.getName to find the human-readable description of the character.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top