Apache UIMA to parse multi-lingual content

https://stackoverflow.com/questions/20539845

31-08-2022
|

Question

I am trying to parse content in non-english languages such as Korean, Chinese etc. Does UIMA have any built-in support. I could not get much information on this in Apache UIMA portal. All I could think was coming up with unicode regex patterns, but even those for some reason dont seem to work. My regex pattern having unicode character is not annotating the word that I need.

Am using JDK1.7, UIMA 2.4.2.

Any help or suggestion is greatly appreciated.

An example below that I am trying

Text : Numéro de réservation 445566553 Code [This text is in a file which I am reading using FileSystemCollectionReader and I have set the encoding to UTF-8]

My RegEx (?<=Num\u00E9ro\sde\sr\u00E9servation\s)(.*?)(?=\sCode)

Expected Output : 445566553

Solution

I'm not sure if the problem you're having is with uima, but the regular expression you posted seems to be fine for me when working with plain java. I'm using java 1.7.0_45. I modified your regular expression just slightly to allow for multiple spaces around the number. Here is an SSCCE that when run the output is '445566553'

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexExample {
    public static void main(String[] args) {
        String test = "Numéro de réservation   445566553 \tCode";
        Pattern pattern = Pattern.compile("(?<=Num\\u00E9ro\\sde\\sr\\u00E9servation)\\s+(.*?)\\s+(?=Code)");
        Matcher matcher = pattern.matcher(test);
        while(matcher.find()) {
            System.out.println("'"+matcher.group(1)+"'");
        }
    }
}

If this isn't what you're looking for then providing an SSCCE that contains a test case you would like to fix would be helpful.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow