Wie konvertiere ich eine .docx-Datei mit asp.net in HTML?

https://stackoverflow.com/questions/55113

09-06-2019
|

Frage

Word 2007 speichert seine Dokumente im .docx-Format, bei dem es sich eigentlich um eine ZIP-Datei handelt, die eine Menge Dinge enthält, einschließlich einer XML-Datei mit dem Dokument.

Ich möchte in der Lage sein, eine .docx-Datei zu nehmen und sie in einem Ordner in meiner asp.net-Web-App abzulegen und den Code die .docx-Datei öffnen zu lassen und das (XML-Teil des) Dokuments als Webseite zu rendern.

Ich habe im Internet nach weiteren Informationen dazu gesucht, aber bisher nicht viel gefunden.Meine Fragen sind:

Würden Sie (a) XSLT verwenden, um XML in HTML umzuwandeln, oder (b) XML-Manipulationsbibliotheken in .net (wie XDocument und XElement in 3.5) verwenden, um in HTML zu konvertieren, oder (c) andere?
Kennen Sie Open-Source-Bibliotheken/-Projekte, die dies getan haben und die ich als Ausgangspunkt verwenden könnte?

Danke!

Lösung

Versuchen Sie, diese post ? Ich weiß nicht, aber vielleicht das, was Sie suchen.

Andere Tipps

Ich schrieb mammoth.js , die eine JavaScript-Bibliothek, die docx-Dateien in HTML umwandelt. Wenn Sie die Rendering-Server-Seite in .NET tun wollen, gibt es auch eine .NET-Version von Mammoth auf NuGet .

Mammoth versucht saubere HTML zu erzeugen, indem semantische Informationen suchen - zum Beispiel Mapping Absatzformate in Word (wie Heading 1) zu entsprechenden Tags und Stil in HTML / CSS (wie <h1>). Wenn Sie etwas, das eine exakte visuelle Kopie erzeugt, dann wahrscheinlich Mammoth ist nicht für Sie. Wenn Sie etwas haben, das bereits gut strukturiert ist und wollen die ordentlich HTML konvertieren, Mammoth könnte den Trick tun.

Word 2007 hat eine API, die Sie in HTML konvertieren können. Hier ist ein Beitrag, der darüber spricht http://msdn.microsoft.com/en -US / Magazin / cc163526.aspx . Sie können Dokumentation rund um die API zu finden, aber ich erinnere mich, dass es ein Konvertit zu HTML-Funktion in der API ist.

Dieser Code hilft .docx Datei in Text

konvertieren

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) { echo "sucess";}else{ echo "not sucess";}

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);     

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
     //header("Content-Type: plain/text");


    $striped_content = strip_tags($content);


      $striped_content = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$striped_content);

    echo nl2br($striped_content); 
}

Ich verwende Interop. Es ist etwas problamatic aber funktioniert in den meisten der Fall in Ordnung.

using System.Runtime.InteropServices;
using Microsoft.Office.Interop.Word;

Dieses gibt die Liste der HTML-Format konvertiert Dokumente Pfad

public List<string> GetHelpDocuments()
    {

        List<string> lstHtmlDocuments = new List<string>();
        foreach (string _sourceFilePath in Directory.GetFiles(""))
        {
            string[] validextentions = { ".doc", ".docx" };
            if (validextentions.Contains(System.IO.Path.GetExtension(_sourceFilePath)))
            {
                sourceFilePath = _sourceFilePath;
                destinationFilePath = _sourceFilePath.Replace(System.IO.Path.GetExtension(_sourceFilePath), ".html");
                if (System.IO.File.Exists(sourceFilePath))
                {
                    //checking if the HTML format of the file already exists. if it does then is it the latest one?
                    if (System.IO.File.Exists(destinationFilePath))
                    {
                        if (System.IO.File.GetCreationTime(destinationFilePath) != System.IO.File.GetCreationTime(sourceFilePath))
                        {
                            System.IO.File.Delete(destinationFilePath);
                            ConvertToHTML();
                        }
                    }
                    else
                    {
                        ConvertToHTML();
                    }

                    lstHtmlDocuments.Add(destinationFilePath);
                }
            }


        }
        return lstHtmlDocuments;
    }

Und diese doc zu html konvertieren.

private void ConvertToHtml()
    {
        IsError = false;
        if (System.IO.File.Exists(sourceFilePath))
        {
            Microsoft.Office.Interop.Word.Application docApp = null;
            string strExtension = System.IO.Path.GetExtension(sourceFilePath);
            try
            {
                docApp = new Microsoft.Office.Interop.Word.Application();
                docApp.Visible = true;

                docApp.DisplayAlerts = WdAlertLevel.wdAlertsNone;
                object fileFormat = WdSaveFormat.wdFormatHTML;
                docApp.Application.Visible = true;
                var doc = docApp.Documents.Open(sourceFilePath);
                doc.SaveAs2(destinationFilePath, fileFormat);
            }
            catch
            {
                IsError = true;
            }
            finally
            {
                try
                {
                    docApp.Quit(SaveChanges: false);

                }
                catch { }
                finally
                {
                    Process[] wProcess = Process.GetProcessesByName("WINWORD");
                    foreach (Process p in wProcess)
                    {
                        p.Kill();
                    }
                }
                Marshal.ReleaseComObject(docApp);
                docApp = null;
                GC.Collect();
            }
        }
    }

Die Tötung des Wortes ist nicht Spaß, aber kann es nicht zulassen, hängt dort und Block andere, nicht wahr?

In der Web / html i html zu einem iframe aus.

Es gibt ein Drop-Down, die die Liste der Hilfedokumente enthält. Der Wert ist der Pfad zur HTML-Version davon und Text Name des Dokuments ist.

private void BindHelpContents()
    {
        List<string> lstHelpDocuments = new List<string>();
        HelpDocuments hDoc = new HelpDocuments(Server.MapPath("~/HelpDocx/docx/"));
        lstHelpDocuments = hDoc.GetHelpDocuments();
        int index = 1;
        ddlHelpDocuments.Items.Insert(0, new ListItem { Value = "0", Text = "---Select Document---", Selected = true });

        foreach (string strHelpDocument in lstHelpDocuments)
        {
            ddlHelpDocuments.Items.Insert(index, new ListItem { Value = strHelpDocument, Text = strHelpDocument.Split('\\')[strHelpDocument.Split('\\').Length - 1].Replace(".html", "") });
            index++;
        }
        FetchDocuments();

    }

auf ausgewählten Index geändert wird renedred einzurahmen

    protected void RenderHelpContents(object sender, EventArgs e)
    {
        try
        {
            if (ddlHelpDocuments.SelectedValue == "0") return;
            string strHtml = ddlHelpDocuments.SelectedValue;
            string newaspxpage = strHtml.Replace(Server.MapPath("~/"), "~/");
            string pageVirtualPath = VirtualPathUtility.ToAbsolute(newaspxpage);// 
            documentholder.Attributes["src"] = pageVirtualPath;
        }
        catch
        {
            lblGError.Text = "Selected document doesn't exist, please refresh the page and try again. If that doesn't help, please contact Support";
        }
    }

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow