How to do part-of-speech tagging of texts, containing mathematical expressions?

https://stackoverflow.com/questions/15687440

30-03-2022
|

Question

The goal is a syntactic parsing of scientific texts. And first I need to make part-of-speech tagging of sentences of such texts. Texts are from arxiv.org. So they are originally in LaTeX. When extracting text from LaTeX documents, math expressions can be converted into MathML (or maybe some other format, but I prefer MathML cause this work is being done to create a specific web-app, and MathML is a convenient tool for this).

The only idea I have is to substitute mathematical expressions with some phrases of natural language and then use some implemented algorithm for pos-tagging. So the question is how to implement this substitutions or, in general, how to implement pos-tagging of texts with mathematics in them?

Solution

I have implemented a formula substitution algorithm on top of the Stanford tagger and it works quite nice. The way to go is, as abecadel has written, to replace every formula with a unique but new word, I used a combination of a word and a hash 'formula-duwkziah'.

OTHER TIPS

Replacing all of the mathematical formulae with a single, unique word seem to be the way to go.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow