Question

I'm trying to call the HtmlTidy library dll from C#. There's a few examples floating around on the net but nothing definitive... and I'm having no end of trouble. I'm pretty certain the problem is with the p/invoke declaration... but danged if I know where I'm going wrong.

I got the libtidy.dll from http://www.paehl.com/open_source/?HTML_Tidy_for_Windows which seems to be a current version.

Here's a console app that demonstrates the problem I'm having:

using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;

namespace ConsoleApplication5
{
    class Program
    {
        [StructLayout(LayoutKind.Sequential)]
        public struct TidyBuffer
        {
            public IntPtr bp;         // Pointer to bytes
            public uint size;         // # bytes currently in use
            public uint allocated;    // # bytes allocated
            public uint next;         // Offset of current input position
        };

        [DllImport("libtidy.dll")]
        public static extern int tidyBufAlloc(ref TidyBuffer tidyBuffer, uint allocSize);


        static void Main(string[] args)
        {
            Console.WriteLine(CleanHtml("<html><body><p>Hello World!</p></body></html>"));
        }

        static string CleanHtml(string inputHtml)
        {
            byte[] inputArray = Encoding.UTF8.GetBytes(inputHtml);
            byte[] inputArray2 = Encoding.UTF8.GetBytes(inputHtml);

            TidyBuffer tidyBuffer2;
            tidyBuffer2.size = 0;
            tidyBuffer2.allocated = 0;
            tidyBuffer2.next = 0;
            tidyBuffer2.bp = IntPtr.Zero;

            //
            // tidyBufAlloc overwrites inputArray2... why? how? seems like
            // tidyBufAlloc is stomping on the stack a bit too much... but
            // how? I've tried changing the calling convention to cdecl and
            // stdcall but no change.
            //
            Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
            tidyBufAlloc(ref tidyBuffer2, 65535);
            Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
            return "did nothing";
        }
    }
}

All in all I'm a bit stumpped. Any help would be appreciated!

Was it helpful?

Solution

You are working with an old definition of the TidyBuffer structure. The new structure is larger so when you call the allocate method it is overwriting the stack location for inputArray2. The new definition is:

    [StructLayout(LayoutKind.Sequential)]        
    public struct TidyBuffer        
    {
        public IntPtr allocator;  // Pointer to custom allocator            
        public IntPtr bp;         // Pointer to bytes            
        public uint size;         // # bytes currently in use            
        public uint allocated;    // # bytes allocated            
        public uint next;         // Offset of current input position        
    };        

OTHER TIPS

For what it's worth, we tried Tidy at work and switched to HtmlAgilityPack.

Try changing your tidyBufAlloc declaration to:

[DllImport("libtidy.dll", CharSet = CharSet.Ansi)]
private static extern int tidyBufAlloc(ref TidyBuffer Buffer, int allocSize);

Note the CharSet.Ansi addition and the "int allocSize" (instead of uint).

Also, see this sample code for an example of using HTML Tidy in C#.

In your example, if inputHTML is large, say 50K, inputArray and inputArray2 will be also be 50K each.

You are then also trying to allocate 65K in the tidyBufAlloc call.

If a pointer is not initialised correctly, it is quite possible a random .NET heap address is being used. Hence overwriting part or all of a seemingly unrelated variable/buffer occurs. It is problaby just luck, or that you have already allocated large buffers, that you are not overwriting a code block which would likely cause a Invalid Memory access error.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top