How to compare different language String values in JAVA?

https://stackoverflow.com/questions/4287694

28-09-2019
|

Question

In my web application I am using two different Languages namely English and Arabic.

I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user

Explanation:

Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.

The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.

I wish that my query should find both are same hometown and retrieve the values. Is it possible?

What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?

EDIT: *I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*

Solution

Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.

This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.

OTHER TIPS

The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is

id   hometown   name
001  California Victor

you could introduce a language code and store

id   lang hometown   name
001  en   California Victor
001  ar   كاليفورنيا Victor

then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.

(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)

Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.

Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.

How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).

This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).

In your setup you will probably have the following points:

Browser <-> Servlet container <-> Database
                   |
                System.out

In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.

Browser to Servlet container

When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.

On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.

Example tomcat's server.xml

<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
           maxThreads="2000"                
           connectionTimeout="20000" 
           redirectPort="8443" />

Servlet container to Database

The database needs to have the correct encodings, or sorting etc will not be right.

Example my.cnf for MySQL

[mysqld] 
 ....
init_connect=''SET collation_connection = utf8_general_ci'' 
init_connect='SET NAMES utf8' 
default-character-set=utf8 
character-set-server = utf8 
collation-server = utf8_general_ci 

[mysql] 
 ....
default-character-set=utf8

Then the JDBC-driver needs to be set for UTF-8.

Example JDBC connect string

jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8

System.out

System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!

Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow