Convert style-laden HTML tables to PDF, in .NET 1.1

https://stackoverflow.com/questions/375120

22-08-2019
|

Question

I have colleagues working on a .NET 1.1 project, where they obtain XML files from an external party and programmatically instruct iTextSharp to generate PDF content based on the XML data.

The tricky part is, within this XML are segments of arbitrary HTML content. These are HTML code users copied and pasted from their Office applications. Still looks ok on a web browser, but when this HTML is fed into iTextSharp's HTMLWorker object to parse and convert into PDF objects, the formatting and alignment run all over the place in the generated PDF document. E.g.

<span id="mceBoundaryType" class="portrait"></span>
<table border="0" cellspacing="0" cellpadding="0" width="636" class="MsoNormalTable"
    style="margin: auto auto auto 4.65pt; width: 477pt; border-collapse: collapse">
    <tbody>
        <tr style="height: 15.75pt">
            <td width="468" valign="bottom" style="padding-right: 5.4pt; padding-left: 5.4pt;
                padding-bottom: 0in; width: 351pt; padding-top: 0in; height: 15.75pt; background-color: transparent;
                border: #ece9d8">
                <p style="margin: 0in 0in 0pt" class="MsoNormal">
                    <font face="Times New Roman">&nbsp;</font></p>
            </td>
            <td colspan="3" width="168" valign="bottom" style="padding-right: 5.4pt; padding-left: 5.4pt;
                padding-bottom: 0in; width: 1.75in; padding-top: 0in; height: 15.75pt; background-color: transparent;
                border: #ece9d8">
                <p style="margin: 0in 0in 0pt; text-align: center" class="MsoNormal" align="center">
                    <u><font face="Times New Roman">Group</font></u></p>
            </td>
        </tr>

The tags are full of Style attributes, and iTextSharp does not support CSS and interpreting that attribute. What are some alternatives other iTextSharp users have tried to workaround this, or other feasible HTML-to-PDF components?

Solution

I have found .NET 2.0-based components like ExpertPDF and ABCpdf do a fairly good job interpreting the CSS styles and aligning the tables properly in PDF. Right now I am suggesting to my colleagues the use of a separate .NET 2.0 web service that can use such components, which will be informed by the ASP.NET 1.1 web application to go ahead and scrape a generated web page that is essentially the report in HTML view.

UPDATE:

This is the answer as it is the recommended approach provided to the application team.

OTHER TIPS

I don't have any solid answers, but I'll give you two directions to explore, both of which I have used before.

1 - use something like HtmlAgilityPack to cleanse your HTML - you can traverse the DOM and remove styles and classes, which could obviously screw up the layout to a certain degree. It is not clear to me whether you need to retain this styling or not. Then, you could use iTextSharp or an alternate program like HtmlDoc (which also does not support CSS) to render to PDF. We wrote a simple wrapper with a method that takes a URL, and then calls Htmldoc to generate the PDF.

2 - render the HTML server-side using a WebBrowser control, generate an image from that, then convert the image to PDF using PDFsharp or the library of your choice. This will obviously not give you PDFs that you can search or copy text from. There is some pretty good sample code here for converting the rendered page to an image (note: you can get full-height images, not just what you can see without scrolling).

Edit: I don't think the WebBrowser control is available in .NET 1.1.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow