
My question is sort of like this question but I have more constraints:

  • I know the document's are reasonably sane
  • they are very regular (they all came from the same source
  • I want about 99% of the visible text
  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
  • I don't care about formatting or even paragraph breaks.

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.

You need to use the HTML Agility Pack.

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.


This code I hacked up today with HTML Agility Pack, will extract unformatted trimmed text.

public static string ExtractText(string html)
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
        if (item.NodeType == HtmlNodeType.Text)
            if (item.InnerText.Trim() != "")
    return String.Join(" ", chunks);

If you want to maintain some level of formatting you can build on the sample provided with the source.

public string Convert(string path)
    HtmlDocument doc = new HtmlDocument();

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    return sw.ToString();

public string ConvertHtml(string html)
    HtmlDocument doc = new HtmlDocument();

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    return sw.ToString();

public void ConvertTo(HtmlNode node, TextWriter outText)
    string html;
    switch (node.NodeType)
        case HtmlNodeType.Comment:
            // don't output comments

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);

        case HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))

            // get text
            html = ((HtmlTextNode) node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)

        case HtmlNodeType.Element:
            switch (node.Name)
                case "p":
                    // treat paragraphs as crlf

            if (node.HasChildNodes)
                ConvertContentTo(node, outText);

private void ConvertContentTo(HtmlNode node, TextWriter outText)
    foreach (HtmlNode subnode in node.ChildNodes)
        ConvertTo(subnode, outText);

Here is the code I am using:

using System.Web;
public static string ExtractText(string html)
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser)

It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.

Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.

Here's an example:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
            for (int i = 0; i < collection.Count; i++)
                if (!string.IsNullOrEmpty(collection[i].InnerText))

             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
            string contentString = string.Join("|", contents.ToArray());

Hope that helps!

Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.

It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.

From the command line, you can use the Lynx text browser like this:

If you want to download a web page in formatted output (i.e., without HTML tags, but instead as it would appear in Lynx), then enter:

lynx -dump URL > filename

If there are any links on the page, the URLs for those links will be included at the end of the downloaded page.

You can disable the list of links with -nolist. For example:

lynx -dump -nolist > filename

Here is the Best way:

  public static string StripHTML(string HTMLText)
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");

Here's a class I developed to accomplish the same thing. All available HTML parsing libraries were far too slow, regex was far too slow as well. Functionality is explained in the code comments. From my benchmarks, this code is a little over 10X faster than HTML Agility Pack's equivalent code when tested on Amazon's landing page (included below).

/// <summary>
/// The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
/// extract text data from a given HTML character array. The class searches for and deletes
/// script and style tags in a first and second pass, with an optional third pass to do the same
/// to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
/// All whitespace encountered is replaced with a single whitespace in to avoid multiple
/// whitespace in the output.
/// Note that the returned text content still may have named character and numbered character
/// references within that, when decoded, may produce multiple whitespace.
/// </summary>
public class FastHtmlTextExtractor

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        // Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

            // Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);

        // Whipe text between all other tags now.
        while(next < len)
            next = SkipUntil(next, '<', input);

            if(next < len)
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);

                next = closeNext + 1;

        // Collect all non-whitespace and non-null chars into a new
        // char array. All whitespace characters are skipped and replaced
        // with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
            if(m_deletionDictionary[next] > 0)
                next += m_deletionDictionary[next];

            if(char.IsWhiteSpace(input[next]) || input[next] == '\0')

                extracted[extractedPos++] = ' ';
                lastSpace = true;
                lastSpace = false;
                extracted[extractedPos++] = input[next];

        return new string(extracted, 0, extractedPos);

    /// <summary>
    /// Does a search in the input array for the characters in the supplied open and closing tag
    /// char arrays. Each match where both tag open and tag close are discovered causes the text
    /// in between the matches to be overwritten by Array.Clear().
    /// </summary>
    /// <param name="openingTag">
    /// The opening tag to search for.
    /// </param>
    /// <param name="closingTag">
    /// The closing tag to search for.
    /// </param>
    /// <param name="input">
    /// The input to search in.
    /// </param>
    private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input)
        int len = input.Length;
        int pos = 0;

            pos = FindNext(pos, openingTag, input);

            if(pos < len)
                var closenext = FindNext(pos, closingTag, input);

                if(closenext < len)
                    m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length);
                    WipeRange(pos - openingTag.Length, closenext, input);

                if(closenext > pos)
                    pos = closenext;
        while(pos < len);

    /// <summary>
    /// Skips as many characters as possible within the input array until the given char is
    /// found. The position of the first instance of the char is returned, or if not found, a
    /// position beyond the end of the input array is returned.
    /// </summary>
    /// <param name="pos">
    /// The starting position to search from within the input array.
    /// </param>
    /// <param name="c">
    /// The character to find.
    /// </param>
    /// <param name="input">
    /// The input to search within.
    /// </param>
    /// <returns>
    /// The position of the found character, or an index beyond the end of the input array.
    /// </returns>
    private int SkipUntil(int pos, char c, char[] input)
        if(pos >= input.Length)
            return pos;

            if(input[pos] == c)
                return pos;

        while(pos < input.Length);

        return pos;

    /// <summary>
    /// Clears a given range in the input array.
    /// </summary>
    /// <param name="start">
    /// The start position from which the array will begin to be cleared.
    /// </param>
    /// <param name="end">
    /// The end position in the array, the position to clear up-until.
    /// </param>
    /// <param name="input">
    /// The source array wherin the supplied range will be cleared.
    /// </param>
    /// <remarks>
    /// Note that the second parameter is called end, not lenghth. This parameter is meant to be
    /// a position in the array, not the amount of entries in the array to clear.
    /// </remarks>
    private void WipeRange(int start, int end, char[] input)
        Array.Clear(input, start, end - start);

    /// <summary>
    /// Finds the next occurance of the supplied char array within the input array. This search
    /// ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position to start searching from.
    /// </param>
    /// <param name="what">
    /// The sequence of characters to find.
    /// </param>
    /// <param name="input">
    /// The input array to perform the search on.
    /// </param>
    /// <returns>
    /// The position of the end of the first matching occurance. That is, the returned position
    /// points to the very end of the search criteria within the input array, not the start. If
    /// no match could be found, a position beyond the end of the input array will be returned.
    /// </returns>
    public int FindNext(int pos, char[] what, char[] input)
            if(Next(ref pos, what, input))
                return pos;
        while(pos < input.Length);

        return pos;

    /// <summary>
    /// Probes the input array at the given position to determine if the next N characters
    /// matches the supplied character sequence. This check ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position at which to check within the input array for a match to the supplied
    /// character sequence.
    /// </param>
    /// <param name="what">
    /// The character sequence to attempt to match. Note that whitespace between characters
    /// within the input array is accebtale.
    /// </param>
    /// <param name="input">
    /// The input array to check within.
    /// </param>
    /// <returns>
    /// True if the next N characters within the input array matches the supplied search
    /// character sequence. Returns false otherwise.
    /// </returns>
    public bool Next(ref int pos, char[] what, char[] input)
        int z = 0;

            if(char.IsWhiteSpace(input[pos]) || input[pos] == '\0')

            if(input[pos] == what[z])

            return false;
        while(pos < input.Length && z < what.Length);

        return z == what.Length;

Equivalent in HtmlAgilityPack:

// Where m_whitespaceRegex is a Regex with [\s].
// Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();

if(doc != null && doc.DocumentNode != null)
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
        foreach(HtmlNode node in allTextNodes)

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
