How do I extract significant text content from a LaTeX document
-
27-10-2019 - |
문제
I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.
I am supposed to omit:
- images,
- tables and other figures,
- equations,
- captions and footnotes.
It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.
Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.
해결책
You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.
다른 팁
Usually you want some LaTeX processing done on the text, say you have
\newcommand*{\SO}{StackOverflow\index{StackOverflow}\xspace}
...
I spend a lot of time on \SO, blah blah ....
Just filtering out the text paragraph here will not give a text like the intended result when it contains any macros.
Therefore trying to extract things directly from the *.tex file usually will leave much to be wanted from the result. It is typically therefore better to work on output from latex processing. I would recommend to convert latex to html and then from html to text. You will probably need some manual clean-up, but I think it should be relatively close.
While detex has been mentioned, however there is another project, aimed at improving it. It is called opendetex, give it a look!