How to get the name of the document, the pipeline is currently working on?

StackOverflow https://stackoverflow.com/questions/16512518

  •  21-04-2022
  •  | 
  •  

문제

Let's say, a corpus have 1k docs, and be processed by a pipeline.
At some point, the pipeline stucks, throws exception or have funny behavior. But all these are very likely to be document-relevant.
So it'd be nice to know which document is being processed in the pipeline. For example, to print out the doc name in a Jape transducer.

도움이 되었습니까?

해결책

To get document processing you can write a simple JAPE rule like:

Phase:  DocName
Input: Token
Options: control = once

Rule:DocName
(
 {Token}
)
-->
{
  System.out.println(doc.getName());
}

Put this rule as a first rule in your pipeline. I hope that you have a least 1 Token in the document.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top