كيف يمكنني تحويل HTML إلى RTF (نص منسق) في .NET دون دفع ثمن أحد المكونات؟

https://stackoverflow.com/questions/150208

02-07-2019
|

سؤال

هل هناك فئة خارجية أو فئة .NET مجانية يمكنها تحويل HTML إلى RTF (للاستخدام في عنصر تحكم Windows Forms الممكّن للنص المنسق)؟

يأتي المتطلب "المجاني" من حقيقة أنني أعمل فقط على نموذج أولي ويمكنني فقط تحميل BrowserControl وعرض HTML إذا لزم الأمر (حتى لو كان بطيئًا) وأن Developer Express سيطلق سراحه مثل هذا السيطرة قريبا العش.

لا أريد أن أتعلم كتابة RTF يدويًا، وأعرف بالفعل لغة HTML، لذا أعتقد أن هذه هي أسرع طريقة للحصول على بعض التعليمات البرمجية القابلة للإثبات بسرعة.

المحلول

في الواقع هناك بسيطة و حر حل:استخدم متصفحك، حسنًا، هذه هي الخدعة التي استخدمتها:

var webBrowser = new WebBrowser();
webBrowser.CreateControl(); // only if needed
webBrowser.DocumentText = *yourhtmlstring*;
while (_webBrowser.DocumentText != *yourhtmlstring*)
    Application.DoEvents();
webBrowser.Document.ExecCommand("SelectAll", false, null);
webBrowser.Document.ExecCommand("Copy", false, null);
*yourRichTextControl*.Paste();

قد يكون هذا أبطأ من الطرق الأخرى ولكنه على الأقل مجاني ويعمل!

نصائح أخرى

تحقق من مقالة CodeProject هذه على XHTML2RTF.

بالتوسع في إجابة سبارتاكو، قمت بتنفيذ ما يلي والذي يعمل بشكل رائع!

    Using reportWebBrowser As New WebBrowser
        reportWebBrowser.CreateControl()
        reportWebBrowser.DocumentText = sbHTMLDoc.ToString
        While reportWebBrowser.DocumentText <> sbHTMLDoc.ToString
            Application.DoEvents()
        End While
        reportWebBrowser.Document.ExecCommand("SelectAll", False, Nothing)
        reportWebBrowser.Document.ExecCommand("Copy", False, Nothing)

        Using reportRichTextBox As New RichTextBox
            reportRichTextBox.Paste()
            reportRichTextBox.SaveFile(DocumentFileName)
        End Using
    End Using

إنها ليست مثالية بالطبع، ولكن إليك الكود الذي أستخدمه لتحويل HTML إلى نص عادي.

(لم أكن المؤلف الأصلي، لقد قمت بتعديله من الكود الموجود على الويب)

public static string ConvertHtmlToText(string source) {

            string result;

            // Remove HTML Development formatting
            // Replace line breaks with space
            // because browsers inserts space
            result = source.Replace("\r", " ");
            // Replace line breaks with space
            // because browsers inserts space
            result = result.Replace("\n", " ");
            // Remove step-formatting
            result = result.Replace("\t", string.Empty);
            // Remove repeating speces becuase browsers ignore them
            result = System.Text.RegularExpressions.Regex.Replace(result,
                                                                  @"( )+", " ");

            // Remove the header (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*head([^>])*>", "<head>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*head( )*>)", "</head>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(<head>).*(</head>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // remove all scripts (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*script([^>])*>", "<script>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*script( )*>)", "</script>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            //result = System.Text.RegularExpressions.Regex.Replace(result, 
            //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
            //         string.Empty, 
            //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<script>).*(</script>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // remove all styles (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*style([^>])*>", "<style>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*style( )*>)", "</style>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(<style>).*(</style>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert tabs in spaces of <td> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*td([^>])*>", "\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert line breaks in places of <BR> and <LI> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*br( )*>", "\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*li( )*>", "\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert line paragraphs (double line breaks) in place
            // if <P>, <DIV> and <TR> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*div([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*tr([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*p([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // Remove remaining tags like <a>, links, images,
            // comments etc - anything thats enclosed inside < >
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<[^>]*>", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // replace special characters:
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&nbsp;", " ",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&bull;", " * ",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&lsaquo;", "<",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&rsaquo;", ">",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&trade;", "(tm)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&frasl;", "/",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<", "<",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @">", ">",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&copy;", "(c)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&reg;", "(r)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove all others. More can be added, see
            // http://hotwired.lycos.com/webmonkey/reference/special_characters/
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&(.{2,6});", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);


            // make line breaking consistent
            result = result.Replace("\n", "\r");

            // Remove extra line breaks and tabs:
            // replace over 2 breaks with 2 and over 4 tabs with 4. 
            // Prepare first to remove any whitespaces inbetween
            // the escaped characters and remove redundant tabs inbetween linebreaks
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)( )+(\r)", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\t)( )+(\t)", "\t\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\t)( )+(\r)", "\t\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)( )+(\t)", "\r\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove redundant tabs
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)(\t)+(\r)", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove multible tabs followind a linebreak with just one tab
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)(\t)+", "\r\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Initial replacement target string for linebreaks
            string breaks = "\r\r\r";
            // Initial replacement target string for tabs
            string tabs = "\t\t\t\t\t";
            for (int index = 0; index < result.Length; index++) {
                result = result.Replace(breaks, "\r\r");
                result = result.Replace(tabs, "\t\t\t\t");
                breaks = breaks + "\r";
                tabs = tabs + "\t";
            }

            // Thats it.
            return result;

    }

ربما ما تحتاجه هو عنصر تحكم لتحرير HTML?

ليرة تركية؛دكتور: أوصي باستخدام OpenXml الشكل و HtmlToOpenXml حزمة nuget إذا كان ذلك ممكنا.

مايكروسوفت وورد كوم

لم أبحث كثيرًا في هذا الموضوع لأن حالة الاستخدام الخاصة بي هي استخدام الوظيفة الموجودة على الخادم مما يجعل مكونات COM ليست اختيارًا رائعًا.

XHTML2RTF

كما ذكرJonathanParker، يمكنك استخدام مكتبة مشروع التعليمات البرمجية هذه.

العيوب هي:

دعم محدود لـ HTML وCSS
ليس حقًا .NET
...

متصفح الويب نماذج ويندوز

كما ذكرSpartaco، يمكنك استخدام نماذج Windows WebBrowser يتحكم.

العيوب هي:

إشارة إلى System.Windows.Forms
يستخدم النسخ واللصق (مشكلة في تعدد مؤشرات الترابط)
يعمل فقط في موضوع STA

تشمل الميزات غير المدعومة ما يلي:

الخطوط
الألوان
قوائم مرقمة
يتوسطه خط (del عنصر)
...

ديف اكسبريس

عينة كود "Paul V" من مركز دعم ديفيكسبرس. (03.02.2015)

public String ConvertRTFToHTML(String RTF)
{   
    MemoryStream ms = new MemoryStream();
    StreamWriter writer = new StreamWriter(ms);
    writer.Write(RTF);
    writer.Flush();
    ms.Position = 0;
    String output = "";
    HtmlEditorExtension.Import(HtmlEditorImportFormat.Rtf, ms, (s, enumerable) => output = s);

    return output;
}

public String ConvertHTMLToRTF(String Html)
{
    MemoryStream ms = new MemoryStream();
    var editor = new ASPxHtmlEditor { Html = html };

    editor.Export(HtmlEditorExportFormat.Rtf, ms);

    ms.Position = 0;
    StreamReader reader = new StreamReader(ms);

    return reader.ReadToEnd();
}

أو يمكنك استخدام RichEditDocumentServer اكتب كما هو موضح في هذا المثال.

أ ترخيص ديفيكسبرس يمكن أن يتأرجح من حوالي 1500 دولار أمريكي إلى 2200 دولار أمريكي.

غير معروف ما هو مدعوم في الواقع.

العيوب هي:

سعر
الكثير من المراجع لشيء واحد صغير
أكثر؟

تشمل الميزات غير المدعومة ما يلي:

الضربة القاضية (del عنصر)

سوتينسوفت

public string ConvertHTMLToRTF(string html)
{
    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    return h.ConvertString(htmlString);
}

public string ConvertRTFToHTML(string rtf)
{
    SautinSoft.RtfToHtml r = new SautinSoft.RtfToHtml();
    byte[] bytes = Encoding.ASCII.GetBytes(rtf);
    r.OpenDocx(bytes );
    return r.ToHtml();
}

يمكن العثور على المزيد من الأمثلة وخيارات التكوين هنا و هنا.

أ الترخيص لهذا المكون يمكن أن الساحل من 400.- دولار أمريكي إلى 2000.- دولار أمريكي.

المدعومة هي التالية:

أتش تي أم أل 3.2
أتش تي أم أل 4.01
أتش تي أم أل 5
CSS
XHTML

العيوب هي:

لست متأكدًا من مدى نشاط التطوير
سعر

قاعدة المعرفة الاستخدام:

تحويل القوائم المرقمة من محرر تريكس الزاوي يدمر إند

DIY

إذا كنت تريد فقط دعم وظائف محدودة، فيمكنك كتابة المحول الخاص بك.لا أوصي بهذا إذا كانت مجموعة الميزات المدعومة كبيرة جدًا.

لدي صغير مشروع عينة هنا ولكن فقط للأغراض التعليمية في حالتها الحالية.

أوبنإكسمل

إذا تنسيق OpenXml لا بأس به أيضًا بالنسبة لحالة الاستخدام الخاصة بك، حيث يمكنك استخدام HtmlToOpenXml حزمة nuget.إنه مجاني ويدعم جميع الميزات التي قمت باختبار الحلول الأخرى ضدها.

المشروع يعتمد على افتح Xml SDK بواسطة مايكروسوفت ويبدو نشطا.

public static byte[] ConvertHtmlToOpenXml(string html)
{
    using (var generatedDocument = new MemoryStream())
    {
        using (var package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
        {
            var mainPart = package.MainDocumentPart;
            if (mainPart == null)
            {
                mainPart = package.AddMainDocumentPart();
                new Document(new Body()).Save(mainPart);
            }

            var converter = new HtmlConverter(mainPart);
            converter.ParseHtml(html);

            mainPart.Document.Save();
        }

        return generatedDocument.ToArray();
    }
}

رابط إلى مثال جوهر

أوصي بأداة وحدة التحكم المسماة باندوك.إنها ليست مكونًا، بل هي حزمة تحويل ضخمة.أنا أستخدمه للتحويل بين HTML و LaTeX.إنه رائع.

القائمة الكاملة للتنسيقات المدعومة يمكنك العثور عليها على صفحة البرنامج.

لتحويل مستند HTML إلى تنسيق RTF تكتب على وحدة التحكم:

pandoc filename.html -f html -t rtf -s -o filename.rtf

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow