Question

RUTA newbie here. I'm processing a document using RUTA and have a lot of normalization to do before I can start annotating. I'm trying to find the best way to do a Find and Replace of sequence of characters using regular expressions and groups on the original document in RUTA. In essence, I'm trying to see how to do something similar to a String.replaceAll in RUTA.

For example, in Java,

inputString = inputString.replaceAll( "(?i)7\\s*\\(SEVEN\\)", "7");

But I can't figure out a simple way to achieve this in RUTA.

Thanks

Was it helpful?

Solution

It's not simple in general because you cannot change the document text in a CAS.

There is some functionality in UIMA Ruta to modify the document, but the result needs to be stored in another CAS view or in an additional file. A few general comments:

  • Simple regular expression can be applied for matching on patterns like in your question: http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.regexprule
  • The action REPLACE enables to remember modifications.
  • The Modifier analysis engine is able to perform modification and is able to store the changed document in an additional CAS view and in an additional HMTL file (HMTL since the modifier can also add colored spans)
  • the ViewWriter anaylsis engine is able to copy a view to another view, if you want to work with "_initialView"
  • If your document contains annotations and these annotations should also be valid after the replace, then different functionality is needed. The HTMLConverter has some parameters for replacements, but I do not know if it solves your problem in this use case without more information or further testing.

Here's the script for the example in your question:

ENGINE utils.Modifier;
ENGINE utils.ViewWriter;
TYPESYSTEM utils.SourceDocumentInformation;
DECLARE ToReplace;

// just create an annotation
"(?i)7\\s*\\(SEVEN\\)" -> ToReplace;

// replace the text covered by all annotations with the string "7"
ToReplace{-> REPLACE("7")}; 
//... the annotation should be removed again with UNMARK before different replacements are performed...  
// it is also possible to do this in a more generic way with features and variables

// ... either store the changed text in the "modified" view and in an additional html file
Document{-> CONFIGURE(Modifier, "outputLocation" = "D:/modified/"), EXEC(Modifier)};

// ... or store the changed text in the "modified" view and in an additional xmiCAS
Document{-> EXEC(Modifier), CONFIGURE(ViewWriter, "inputView" = "modified", "output" = "../modified/"), EXEC(ViewWriter)};

Just to mention: The Modfier has some small bug resulting in doubled whitespaces.

A more generic way to model the replacements could be:

DECLARE Annotation ToReplace(STRING r);
"(?i)(7)\\s*\\(SEVEN\\)" -> ToReplace ("r" = 1);
ToReplace{-> REPLACE(ToReplace.r)};

The ToReplace annotations have now an additional string feature that stores the values that should replace the covered text of the annotations. The regexp expression has an additional capturing group, which is used to specify the string in the annotation (assignment of the value using the number of the capturing group). The rule with the REPLACE is now more generic since the actual value does not need to be given in the action, but the value of the feature is applied. The last rule can, therefore, be used for any replacements specified by other rules.

Consecutive replacements that operate on the changed text need to specified in pipeline with sofa mappings in general, since later rule need to operate on different views. In the UIMA Ruta Workbench, one could define the find/replace in separate script files, and then use one launch configuration for each script file. The launch configurations are able to specify the input and output folder. Combined with the ViewWriter, the user is able to build a chain of scripts file that operate in the output folder of previous script files.

Consecutive replacements can also be done in one script file, but with some restrictions. The REPLACE action actually stores the new text in the replacement feature of each RutaBasic annotation. The first RutaBasic get the complete new string and the other RutaBasic are set to the empty string. When the new text is created by the Modifier, the covered text of the Ruta basic annotations are replaced by the values of the feature, thus the first token is replaced by the complete replacement string and the other token are deleted. Knowing this procedure, rules can operate dependent of previous replacements and change the respective feature values. Overall, consecutive replacements are possible, but not straightforward.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top