Question

I am processing badly formed html pages and therefor need to do some cleaning up. http://validator.w3.org/ Tidy function produces the exact output I desired. However, I would like to clean up the HTML files as a part of a larger Python script. I tried:

from tidylib import tidy_document
tidy, errors = tidy_document(html)

but, although tidylib works fine, the output isn't exactly as "beautiful" as on the w3. I also found library for w3c markup validation service, but I didn't find a method for getting cleaned up HTML. My question is: whats the best way of getting HTML's cleaned up using a Python script (may call an outside program/web solution) - the best way being output produced by w3. Should I use additional options with tidylib, is there a suitable method in the library for w3c markup validation service or should I try something else. Pointers/code snippets much appriciated.

Était-ce utile?

La solution

You can set tidy options through tidylib.BASE_OPTIONS

PyTidy example

Tidy options quick ref

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top