Question

I'm using google docs, and some templates we are using were created using MS-Office.
The resulting HTML is fat and ugly, and the 500KB per doc limitation on google makes some cleanup mandatory. I was able to find redundant "style" attributes and move them to some CSS class, and rename the most redundant classes names to shorter ones, which makes me save about 50% of the original size.
Are you aware of some existing tools/scripts/lib which could do this painful job for me, or at least help me to write this magic tool ?

Thanks in advance !

EDIT: I gave a try to both tidy, demoronizer and "manual rewrite":
- Input : 140Kb
- Tidy'ed : 110Kb
- Demoronized : 135Kb

So my favorite answer will be "rewrite it!"

Thanks !

Was it helpful?

Solution

MS-Office makes crappy HTML, period. You're better of spending time rebuilding the HTML from the original text than trying to walk through that minefield.

I made a few macros that do some search/replace functions on Word to do basic things like wrap <p> tags around paragraphs and stuff like that, then re-markup the whole thing from scratch.

OTHER TIPS

You could try tidy it will clean up many things.

Without commenting on its name, I could mention demoronizer, which the author describes as:

...a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications.

YMMV.

One of my favourite utilties now is actually Windows Live Writer - it does a neat job of stripping rubbish out of Word doc files. Some might disagree but I use it quite often!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top