Question

I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.

Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT is Turing complete, but close enough to reality).

FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).

I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.

Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.

Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.

Was it helpful?

Solution

Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.

There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).

For hyphenation, try this project.

OTHER TIPS

If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e. M-x w3m-find-file foo.html), save the rendered page as a plain text file, and then call M-x set-justification-full on it.

You can even write a small function to do the job:

(defun my-html-to-justifed-text (html-file text-file)
  "Convert HTML-FILE to plain TEXT-FILE."
  (find-file html-file)
  (w3m-rendering-buffer)
  (set-justification-full (point-min) (point-max))
  (write-file text-file))

(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")

Links or lynx might be worth a try, see the -dump switch. The encoding part you can then easily solve separately using iconv or something similar.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top