The goal is a syntactic parsing of scientific texts. And first I need to make part-of-speech tagging of sentences of such texts. Texts are from arxiv.org. So they are originally in LaTeX. When extracting text from LaTeX documents, math expressions can be converted into MathML (or maybe some other format, but I prefer MathML cause this work is being done to create a specific web-app, and MathML is a convenient tool for this).

The only idea I have is to substitute mathematical expressions with some phrases of natural language and then use some implemented algorithm for pos-tagging. So the question is how to implement this substitutions or, in general, how to implement pos-tagging of texts with mathematics in them?

有帮助吗?

解决方案

I have implemented a formula substitution algorithm on top of the Stanford tagger and it works quite nice. The way to go is, as abecadel has written, to replace every formula with a unique but new word, I used a combination of a word and a hash 'formula-duwkziah'.

其他提示

Replacing all of the mathematical formulae with a single, unique word seem to be the way to go.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top