Question

Hey, I am attempting to use the Microsoft.MSHTML (Version 7.0.3300.0) library to extract the body text from an HTML string. I've abstracted this functionality into a single helper method GetBody(string).

When called in an infinite loop, the process eventually runs out of memory (confirmed by eyeballing Mem Usage in Task Manager). I suspect the problem is due to my incorrect cleanup of the MSHTML objects. What am I doing wrong?

My current definition of GetBody(string) is:

public static string GetBody(string html)
{
    mshtml.IHTMLDocument2 htmlDoc = null;
    mshtml.IHTMLElement bodyElement = null;
    string body;

    try
    {
        htmlDoc = new mshtml.HTMLDocumentClass();
        htmlDoc.write(html);
        bodyElement = htmlDoc.body;
        body = bodyElement.innerText;
    }
    catch (Exception ex)
    {
        Trace.TraceError("Failed to use MSHTML to parse HTML body: " + ex.Message);
        body = email.Body;
    }
    finally
    {
        if (bodyElement != null)
            Marshal.ReleaseComObject(bodyElement);
        if (htmlDoc != null)
            Marshal.ReleaseComObject(htmlDoc);
    }

    return body;
}

Edit: the memory leak has been traced to the code used in populating a value for html. In this case it was Outlook Redemption.

Was it helpful?

Solution

It has been a long time since I have used mshtml, but doesn't the IHTMLElement2 interface have a close method? Have you tried calling it?

How long did the loop run before the leak was obvious?

I will see if I can dig through some of the legacy code I have here that uses mshtml and see how the developers released the objects.

EDIT:

The old code we have here calls close on the HTMLDocument2 then release com object as you have it.

One thing to note though is that the ReleaseComObject method is called in a loop until it returns zero. This will ensure all com wrapers and the original object are released, there is a note about it here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top