Question

I tried to convert html to plain text with the following function but still getting error while converting.

private static string HtmlToPlainText(string html)
        {
            const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
            const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
            const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
            var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
            var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
            var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

            var text = html;
            //Decode html specific characters
            text = System.Net.WebUtility.HtmlDecode(text);
            //Remove tag whitespace/line breaks
            text = tagWhiteSpaceRegex.Replace(text, "><");
            //Replace <br /> with line breaks
            text = lineBreakRegex.Replace(text, Environment.NewLine);
            //Strip formatting
            text = stripFormattingRegex.Replace(text, string.Empty);
            text = text.Replace(">", "");

            return text;
        }

when I tried to debug the code it display \r and \r\n also in plain text output.This function is not properly convert the html to plain text. Can anyone suggest me any other conversion function?

Thanks

Was it helpful?

Solution

You can use HtmlAgilityPack's HtmlToText demo, which can be found here.

I had a look at the other answers but they all suggest various solutions involving regular expressions. I thought that HtmlAgilityPack didn't get enough attention.

All you need to do is plug the NuGet package in your project and follow the example.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top