Question

I am building a text extractor for XML using UIMA. As I am a total beginner to the UIMA framework, I want to know how to go about it.

I understand that UIMA can annotate specific parts of the file, but how do I extract the information efficiently? Any help is appreciated.

Thanks, Jatin

Was it helpful?

Solution

In the limited perspective of a developer of UIMA Ruta, I use the HtmlAnnotator of UIMA Ruta for these use cases. This is certainly not the most efficient approach. The analysis engine won't use separate types for the elements as it knows only the most common html tags, but I perform the conversion to a predefined type system in UIMA Ruta if needed. At the backend, the htmlparser is applied.

OTHER TIPS

Here is a collection reader to get you started with:

import static com.google.common.base.Preconditions.checkArgument;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;

import org.apache.uima.UimaContext;
import org.apache.uima.collection.CollectionException;
import org.apache.uima.fit.descriptor.TypeCapability;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.Text;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;



@TypeCapability(outputs = "xxx")
public class XmlCollectionReader extends JCasCollectionReader_ImplBase {
    private static Logger LOG = LoggerFactory.getLogger(XmlCollectionReader.class);

    private SAXBuilder builder;
    private XMLOutputter xo;
    private XPathExpression<Object> sentenceXPath;

    @Override
    public void initialize(UimaContext context) throws ResourceInitializationException {
        super.initialize(context);
        try {
            File corpusDir = new File(inputDir);
            checkArgument(corpusDir.exists());
            fileIterator = DirectoryIterator.get(directoryIterator, corpusDir, "xml", false);
            builder = new SAXBuilder();
            xo = new XMLOutputter();
            xo.setFormat(Format.getRawFormat());
            sentenceXPath = XPathFactory.instance().compile("//S");
        } catch (Exception e) {
            throw new ResourceInitializationException(
                    ResourceInitializationException.NO_RESOURCE_FOR_PARAMETERS,
                    new Object[] { inputDir });
        }
    }

    public void getNext(JCas jcas) throws IOException, CollectionException {

        File file = fileIterator.next();
        try {
            LOG.debug("reading {}", file.getName());
            Document doc = builder.build(new FileInputStream(file));
            Element rootNode = doc.getRootElement();

            String title = xo.outputString(rootNode.getChild("Title").getContent());

            for (Object sentence : sentenceXPath.evaluate(rootNode)) {
                Element sentenceE = (Element) sentence;
                    ...
                }
            }

            jcas.setDocumentText(...);

        } catch (JDOMException e) {
            throw new CollectionException(e);
        }
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top