Come posso convertire un .docx in html utilizzando asp.net?

https://stackoverflow.com/questions/55113

09-06-2019
|

Domanda

Word 2007 salva i suoi documenti in formato .docx che è in realtà un file zip con un sacco di cose al suo interno, incluso un file xml con il documento.

Voglio essere in grado di prendere un file .docx e rilasciarlo in una cartella nella mia app Web asp.net e fare in modo che il codice apra il file .docx ed esegua il rendering del documento (parte xml del) come pagina Web.

Ho cercato sul web ulteriori informazioni a riguardo ma finora non ho trovato molto.Le mie domande sono:

Utilizzeresti (a) XSLT per trasformare XML in HTML o (b) utilizzeresti librerie di manipolazione xml in .net (come XDocument e XElement in 3.5) per convertire in HTML o (c) altro?
Conosci qualche libreria/progetto open source che ha fatto questo e che potrei usare come punto di partenza?

Grazie!

Soluzione

Prova questo inviare?Non lo so ma potrebbe essere quello che stai cercando.

Altri suggerimenti

scrissi mammoth.js, che è una libreria JavaScript che converte i file docx in HTML.Se vuoi eseguire il rendering lato server in .NET, esiste anche una versione .NET di Mammoth disponibile su NuGet.

Mammoth cerca di produrre HTML pulito esaminando le informazioni semantiche, ad esempio mappando gli stili di paragrafo in Word (come Heading 1) per adattare i tag e lo stile in HTML/CSS (come <h1>).Se vuoi qualcosa che produca una copia visiva esatta, probabilmente Mammoth non fa per te.Se hai qualcosa che è già ben strutturato e vuoi convertirlo in HTML ordinato, Mammoth potrebbe fare al caso tuo.

Word 2007 dispone di un'API che puoi utilizzare per convertire in HTML.Ecco un post che ne parla http://msdn.microsoft.com/en-us/magazine/cc163526.aspx.Puoi trovare la documentazione sull'API, ma ricordo che nell'API è presente una funzione di conversione in HTML.

Questo codice aiuterà a convertire .docx file in testo

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

Sto utilizzando Interoperabilità.È un po’ problematico ma funziona bene nella maggior parte dei casi.

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

Questo restituisce l'elenco dei percorsi dei documenti convertiti in html

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

E questo per convertire doc in html.

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

Uccidere la parola non è divertente, ma non possiamo lasciarla lì in sospeso e bloccare gli altri, giusto?

Nel web/html eseguo il rendering di HTML in un iframe.

C'è un menu a discesa che contiene l'elenco dei documenti di aiuto.Il valore è il percorso della sua versione html e il testo è il nome del documento.

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

una volta modificato l'indice selezionato, viene trasformato in frame

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow