How do I extract significant text content from a LaTeX document

https://stackoverflow.com/questions/4837177

27-10-2019
|

Question

I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.

I am supposed to omit:

images,
tables and other figures,
equations,
captions and footnotes.

It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.

Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.

Solution

You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.

OTHER TIPS

Yes : untex, a simple C script. You can also look at detex.

You could use a document converter like pandoc, or convert the output PDF to plain text with something like Calibre.

Usually you want some LaTeX processing done on the text, say you have

\newcommand*{\SO}{StackOverflow\index{StackOverflow}\xspace}

...

I spend a lot of time on \SO, blah blah ....

Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.

Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.

While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow