Question

Ugh. Word is notorious for its bloated, convoluted, non-standards-compliant, non-semantic HTML. Unfortunately, I have a professor who is requiring us to generate an outline to very exacting standards. I'd rather not hand-write it, so I decided to make something that would be useful for my classmates as well. I created the outline using a simple numbered list in NeoOffice on my Mac, exported it as HTML, and wrote quite a bit of CSS to style it. Then, I got someone to create an ordered list in Word for Windows, export it as html, and send it to me to check compatibility. After scrolling miles down the page, trying to repress a shudder, I saw a problem. Word did not use <ol> and <li>. It used mountains of nested <span>s with classes out the wazoo. I hate to see all my work go to waste, but this content is impossible to work with—I'd have to style on a document-to-document basis, rather than with a universal stylesheet.

Ideally, Word would generate HTML using standard tags so that I could style it just like any other list, but this doesn't seem to be the case. How can I make it generate lists that actually use <ul> and <li> rather than <span>, or at least modify something in my code to somehow work with the way weird way it does create lists?

Was it helpful?

Solution 8

From doing some research, it appears that the approach of converting the document to HTML isn't practical. Word is simply too variable in its approach to file saving and HTML generation for a single document, not to mention differences among different versions of Word. Similar to Wyatt's suggestion, there may be ways to clean up the code, but none of them are perfect. Digging around the API may provide a way to parse this more easily, but it may turn out that this is in practice just as convoluted. It seems that using word as a list-generation tool simply is unrealistic.

OTHER TIPS

The guys who wrote Winword and its HTML generation are smart guys. If it was easy to use HTML features in a purist way they would have done so.

Word is about creating paper-optimised layouts. it supports concepts such as tab-stops and multi-level numbering that HTML doesn't support, or is only just starting to. As a result, the HTML version of a Word document is not 'nice' HTML, but an attempt to retain the features of the Word document accurately.

When Word re-opens an HTML file it has saved, it does some clever reverse-engineering on the document, so that renders in Word looking pretty much like it started. Equally, if you insert the HTML as a snippet into a web-page, retaining Word CSS, the results are pretty faithful. In this case there is a culture clash between the underlying CSS of the webpage and Word's CSS, and some effort is required to make the best of a bad job. The Word HTML doesn't use UTF-8 either, which needs some handling.

HTMLTidy can be used to rip out Word mark-up, but some more massaging is required after this for good rendering within a webpage. I have worked on a product for 15 years which does this mixing of Word and web pages, and the results can be quite good if you fine tune the CSS.

We used Word because we are creating paper-versions, and importing text from reports written in Word, not because we couldn't find a dedicated HTML editor.

I would not recommend using Word to create tidy purist HTML. You wouldn't use a can-opener to open a bottle of wine, would you?

Life would be much simpler if: a) Microsoft re-engineered the myriad options on its highly confusing 'bullets and number' feature, b) HTML provided native, and properly featured, multi-level numbering support, instead of the after-thought approaches currently available. The weakness of HTML in this area can be seen in the flimsy numbering options available in Google Docs.

So much has improved with HTML 5, maybe we can hope that HTML 6 will help bridge the word processor / HTML editor divide.

Use this resource http://word2cleanhtml.com/ to convert Word documents to clean HTML. Very useful, in my opinion.

If you can get your hands on a Windows PC, use Notepad++ (http://notepad-plus-plus.org/) to paste the code, and then select the plugin to format the code.

Use a WYSIWYG editor as the list generator. This would remove the need for the users to deal with raw CSS, at the cost of taking them out of the comfort zone of Microsoft Word.

Creative use of Word's Find and Replace might also work. For example, open the HTML file with NotePad, copy and paste the text back into a Word document. Open Find and Replace. If the HTML looks like this (for instance), with "This is the first line of text" being the first line item:

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span...(Cut due to berevity)...
-height:115%'>This is the first line of text<o:p></o:p></span></p>

Then find and replace with Wildcards on for \<p*line-height:115%'\ and replace with nothing. It may take a series of Finds/Replaces. The HTML markup is copious but everything else equal, it is consistent at least.

If you've got dreamweaver handy, there is a magic "clean up word HTML" button that does wonders in this scenario.

MSWord is only as smart as the author - an ordered list is coverted as such into HTML only if it was created in MSWord as such. This means that a list must be formatted as such per MSWord constructs and not how it is displayed on the page. Many people will create lists that "appear" to be ordered or undordered using tabs and other formatting and not using MSWord list functions. Saving to HTML tries to save it as it was written, not how it was displayed.

You can link an external stylesheet to an HTML document in Work under the Developer tab -> Document Template -> Linked CSS. You can then use this to override almost any style generated by Word.

Credit: https://superuser.com/questions/65107/how-to-apply-external-css-stylesheet-to-document-in-microsoft-word/65144#65144

Note: I did this using Word 2013, but it is not a new feature.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top