¿Cómo convierto un .docx a html usando asp.net?

https://stackoverflow.com/questions/55113

09-06-2019
|

Pregunta

Word 2007 guarda sus documentos en formato .docx, que en realidad es un archivo zip con un montón de cosas, incluido un archivo xml con el documento.

Quiero poder tomar un archivo .docx y colocarlo en una carpeta de mi aplicación web asp.net y hacer que el código abra el archivo .docx y represente el documento (la parte xml del) como una página web.

He estado buscando en la web más información sobre esto pero hasta ahora no he encontrado mucha.Mis preguntas son:

¿(a) usaría XSLT para transformar XML a HTML, o (b) usaría bibliotecas de manipulación xml en .net (como XDocument y XElement en 3.5) para convertir a HTML u (c) otro?
¿Conoce alguna biblioteca/proyecto de código abierto que haya hecho esto y que pueda utilizar como punto de partida?

¡Gracias!

Solución

Prueba esto correo?No lo sé, pero puede que sea lo que estás buscando.

Otros consejos

escribí mamut.js, que es una biblioteca de JavaScript que convierte archivos docx a HTML.Si desea realizar el renderizado del lado del servidor en .NET, también existe una versión .NET de Mammoth. disponible en NuGet.

Mammoth intenta producir HTML limpio analizando información semántica; por ejemplo, mapeando estilos de párrafo en Word (como Heading 1) para etiquetas y estilos apropiados en HTML/CSS (como <h1>).Si desea algo que produzca una copia visual exacta, entonces Mammoth probablemente no sea para usted.Si tiene algo que ya está bien estructurado y desea convertirlo a HTML ordenado, Mammoth podría ser la solución.

Word 2007 tiene una API que puede utilizar para convertir a HTML.Aquí tienes un post que habla de ello. http://msdn.microsoft.com/en-us/magazine/cc163526.aspx.Puede encontrar documentación sobre la API, pero recuerdo que hay una función de conversión a HTML en la API.

Este código ayudará a convertir .docx archivo a texto

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

Estoy usando Interoperabilidad.Es algo problemático pero funciona bien en la mayoría de los casos.

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

Éste devuelve la lista de rutas de los documentos convertidos en html.

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

Y este para convertir doc a html.

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

Matar la palabra no es divertido, pero no podemos dejar que quede ahí y bloquee a otras, ¿verdad?

En web/html represento html en un iframe.

Hay un menú desplegable que contiene la lista de documentos de ayuda.El valor es la ruta a la versión html y el texto es el nombre del documento.

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

Al cambiar el índice seleccionado, se vuelve a editar en el marco.

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow