How do I convert a .docx to html using asp.net?

https://stackoverflow.com/questions/55113

09-06-2019
|

Question

Word 2007 saves its documents in .docx format which is really a zip file with a bunch of stuff in it including an xml file with the document.

I want to be able to take a .docx file and drop it into a folder in my asp.net web app and have the code open the .docx file and render the (xml part of the) document as a web page.

I've been searching the web for more information on this but so far haven't found much. My questions are:

Would you (a) use XSLT to transform the XML to HTML, or (b) use xml manipulation libraries in .net (such as XDocument and XElement in 3.5) to convert to HTML or (c) other?
Do you know of any open source libraries/projects that have done this that I could use as a starting point?

Thanks!

Solution

Try this post? I don't know but might be what you are looking for.

OTHER TIPS

I wrote mammoth.js, which is a JavaScript library that converts docx files to HTML. If you want to do the rendering server-side in .NET, there is also a .NET version of Mammoth available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

Word 2007 has an API that you can use to convert to HTML. Here's a post that talks about it http://msdn.microsoft.com/en-us/magazine/cc163526.aspx. You can find documentation around the API, but I remember that there is a convert to HTML function in the API.

This code will helps to convert .docx file to text

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

I'm using Interop. It is somewhat problamatic but works fine in most of the case.

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

This one returns the list of html converted documents' path

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

And this one to convert doc to html.

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

The killing of the word is not fun, but can't let it hanging there and block others, right?

In the web/html i render html to a iframe.

There is a dropdown which contains the list of help documents. Value is the path to the html version of it and text is name of the document.

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

on selected index changed, it is renedred to frame

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow