How start with UIMA and simple NLP tasks?

https://stackoverflow.com/questions/19840576

29-07-2022
|

Question

I've recently found out about UIMA (http://uima.apache.org/). It looks promising for simple NLP tasks, such as tokenizing, sentence splitting, part-of-speech tagging etc.

I've managed to get my hands on an already configured minimal java sample that is using OpenNLP components for its pipeline.

The code looks like this:

public void ApplyPipeline() throws IOException, InvalidXMLException,
        ResourceInitializationException, AnalysisEngineProcessException {

    XMLInputSource in = new XMLInputSource(
            "opennlp/OpenNlpTextAnalyzer.xml");
    ResourceSpecifier specifier = UIMAFramework.getXMLParser()
            .parseResourceSpecifier(in);

    AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);

    JCas jcas = ae.newJCas();
    jcas.setDocumentText("This is my text.");

    ae.process(jcas);
    this.doSomethingWithResults(jcas);

    jcas.reset();
    ae.destroy();
}

private void doSomethingWithResults(JCas jcas) {
    AnnotationIndex<Annotation> idx = jcas.getAnnotationIndex();
    FSIterator<Annotation> it = idx.iterator();

    while (it.hasNext()) {
        System.out.println(it.next().toString());
    }

}

Excerpt from OpenNlpTextAnalyzer.xml:

<delegateAnalysisEngine key="SentenceDetector">
    <import location="SentenceDetector.xml" />
</delegateAnalysisEngine>
<delegateAnalysisEngine key="Tokenizer">
    <import location="Tokenizer.xml" />
</delegateAnalysisEngine>

The java code produces output like this:

Token
   sofa: _InitialView
   begin: 426
   end: 435
   pos: "NNP"

I'm trying to get the same information from each Annotation object that the toString() method uses. I've already looked into UIMA's source code to understand where the values are coming from. My attempts to retrieve them sort of works, but they aren't smart in any way.

I'm struggling to find easy examples that, extract information out of the JCas objects.

I'm looking for a way to get for instance all Annotations produces by my PosTagger or by the SentenceSplitter for further usage.

I guess

List<Feature> feats = it.next().getType().getFeatures();

is a start to get values, but due to UIMA owns classes for primitive types, even the source code of the toString method in the annotation class reads like a slap in the face.

Where do I find java code that uses basic UIMA stuff and where are good tutorials (except javadoc from the framework itself)?

Solution

Generate JCas wrapper classes for your annotation types (you can do this using the type system editor UIMA plugin for Eclipse that comes with UIMA). This will provide you with Java classes that you can use to access the annotations - these offer getters and setters for features.

You should have a look at uimaFIT, which provides a more convenient API including convenience methods to retrieve annotations from the JCas, e.g. select(jcas, Token.class) (where Token.class is one of the classes you generated with the type system editor).

You could find some quick-starting Groovy scripts and a collection of UIMA components on the DKPro Core page.

There is material from the UIMA@GSCL 2013 tutorial (slides and sample code) which might be useful for you. Go here and scroll down to "Tutorial".

Disclosure: I'm developer on UIMA, uimaFIT, DKPro Core and co-organizer on the UIMA@GSCL 2013 workshop.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow