How do I extract significant text content from a LaTeX document
-
27-10-2019 - |
Question
I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.
I am supposed to omit:
- images,
- tables and other figures,
- equations,
- captions and footnotes.
It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.
Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.
Solution
You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.
OTHER TIPS
Usually you want some LaTeX processing done on the text, say you have
\newcommand*{\SO}{StackOverflow\index{StackOverflow}\xspace}
...
I spend a lot of time on \SO, blah blah ....
Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.
Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.
While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!