Parse both en_US and en-US as locale in Java

https://softwareengineering.stackexchange.com/questions/325458

22-12-2020
|

Question

I am writing an API (using Java) that takes locale as a parameter. We want the clients to be able to specify "en-US" or "en_US" as they both seem to be widely used across all languages.

I did go through these links

Java documentation (source 3 from above) states "Well-formed variant values have the form SUBTAG (('_'|'-') SUBTAG)* where SUBTAG = [0-9][0-9a-zA-Z]{3} | [0-9a-zA-Z]{5,8}. (Note: BCP 47 only uses hyphen ('-') as a delimiter, this is more lenient)."

The way I understood this is both "_" and "-" are supported, but their code supports only "-". see my sample unit test below which fails, but passes if i use "en-US".

@Test
public void testLocale() {
    Locale locale = Locale.forLanguageTag("en_US");
    assertThat(locale.getLanguage(), equalTo("en"));
}

Are there ways to parse a locale from both forms of string "en_US" and "en-US"? What is the recommended approach here ?

La solution

Replace every occurence of "_" by "-" in the locale string when you receive the parameter and let the logic expect only "-".

Autres conseils

Rather than replacing without understanding I'd like to explain why there appear to be two forms one underscored and another hyphened and why one should care.

tl;dr it is not simple as a single char replacement.

1. specifications

There is several specification at play here :

ICU: The International Components for Unicode which has a section on how to encode / represent locales (language, country, script, variant, etc.), this project is mostly used in native languages like C. And should not be mixed with locales in a POSIX system.
Unicode CLDR: Which interestingly offers a simple page on the equivalence with language tag. This representation allows both hyphens and underscore, while preferring the hyphen. This one is also different than ICU.
(BCP 47 / ~~RFC 4646~~) ⇒ RFC 5646 is about the language tag used in HTTP headers (RFC 3282) especially in Content-Language and in Accept-Language (language ranges).

So the standard in the web is to use language tags in some headers. So that means

If the server application is relying on standard HTTP headers it should handle language tags (language range more specifically for the Accept-Language).
If the server application is using custom header like Company-MyApp-Other-Language then no problem, it is custom to this ecosystem.
If the client application are using standard HTTP headers but using a bad format, then indeed the server-app should try to handle those.

2. some implementation details

The nice thing about PHP's Locale is that it handles ICU format and Language-tags.

However for Java the story is a bit more complicated, since the Locale class is pretty old (since JDK 1.1 circa 1997) and was not able to parse or format (toString) to any of the above mentioned standards, it used underscore as a separator but it is not ICU compliant. In JDK 1.7 circa 2011 the forLanguageTag / toLanguageTag were added to support BCP 47 / RFC 4646 standard ; he old methods kept their legacy behaviour for backward compatibility. At this date (march 2018) ICU /CLDR locale format is not supported by JDK's Locale.

Indeed the naive approach, working in most situation (99.999%) is to replace underscores by hyphens. Some rare case can appear when the string only contains an extension.

However this does not work for every ICU representation especially when there's ICU keywords like @currency=..., those cannot be converted via a single char replace.

sr_Latn_RS_REVISED@currency=USD
en_IE@currency=IEP

Also this won't work for java clients that use the the toString of the java Locale, this is more likely to cause problems for some country where the script is an important part for the locale, e.g. for Serbia :

Locale.forLanguageTag("sr-Latn-RS").toString() ⇒ sr_RS_#Latn, this legacy toString cannot be interpreted as a valid language tag, in fact the parser dismiss the script completely.

Just reading the toString javadoc should raise your eyebrow :

Returns: A string representation of the Locale, for debugging.

(Emphasis is mine.)

So handling every possible bad behaviour in a standard header is tricky, the best would be to impose to respect the language tag RFC, which the browser respects. If not possible then it is best to identify which application is responsible for this misbehaviour and what is the format being used. This would avoid to handle every possible format.

3. post scriptum

Besides, the javadoc part that is quoted in the question only applies to the variant part of a locale, not to the language and country

Well-formed variant values have the form SUBTAG (('_'|'-') SUBTAG)* where SUBTAG = [0-9][0-9a-zA-Z]{3} | [0-9a-zA-Z]{5,8}. (Note: BCP 47 only uses hyphen ('-') as a delimiter, this is more lenient).

Licencié sous: CC-BY-SA avec attribution

Non affilié à softwareengineering.stackexchange