How can you programmatically (or with a tool) convert .MHT mhtml files to regular HTML and CSS files?

StackOverflow https://stackoverflow.com/questions/16203002

  •  11-04-2022
  •  | 
  •  

Question

Many tools have a way to export a .MHT file. I want a way to convert that single file to a collection of files, an HTML file, the relevant images, and CSS files, that I could then upload to a webhost and be consumable by all browsers. Does anybody know any tools or libraries or algorithms to do this.

Was it helpful?

Solution

Well, you can open the .MHT file in IE and the Save it as a a web page. I tested this with this page, and even though it looked odd in IE (it's IE after all), it saved and then opened fine in Chrome (as in, it looked like it should).

Barring that method, looking at the file itself, text blocks are saved in the file as-is, and all other content is saved in Base64. Each item of content is preceded by:

[Boundary]
Content-Type: [Mime Type]
Content-Transfer-Encoding: [Encoding Type]
Content-Location: [Full path of content]

Where [Mime Type], [Encoding Type], and [Full path of content] are variable. [Encoding Type] appears to be either base64 or quoted-printable. [Boundary] is defined in the beginning of the .MHT file like so:

From: <Saved by WebKit>
Subject: converter - How can you programmatically (or with a tool) convert .MHT mhtml        files to regular HTML and CSS files? - Stack Overflow
Date: Fri, 9 May 2013 13:53:36 -0400
MIME-Version: 1.0
Content-Type: multipart/related;
    type="text/html";
    boundary="----=_NextPart_000_0C08_58653ABB.B67612B7"

Using that, you could make your own file parser if needed.

OTHER TIPS

Besides IE and MS Word, there's an open-source cross-platform program called 'mht2html' first written in 2007 and last updated in 2016. It has both a GUI and terminal interface.

I haven't tested it yet but it seems to have received good reviews.

MHT file is essentially MIME. So, it's possible to use Chilkat.Mime or completely free System.Net.Mime components to access its internal structure. If, for example, MHT contains images, they can be replaced with base64 strings in the output HTML.

Imports HtmlAgilityPack
Imports Fizzler.Systems.HtmlAgilityPack
Imports Chilkat
Public Function ConvertMhtToHtml(ByVal mhtFile As String) As String
    Dim chilkatWholeMime As New Chilkat.Mime
    'Load mime'
    chilkatWholeMime.LoadMimeFile(mhtFile)
    'Get html string, which is 1-st part of mime'
    Dim html As String = chilkatWholeMime.GetPart(0).GetBodyDecoded
    'Create collection for storing url of images and theirs base64 representations'
    Dim allImages As New Specialized.NameValueCollection
    'Iterate through mime parts'
    For i = 1 To chilkatWholeMime.NumParts - 1
        Dim m As Chilkat.Mime = chilkatWholeMime.GetPart(i)
        'See if it is image'
        If m.IsImage AndAlso m.Encoding = "base64" Then
            allImages.Add(m.GetHeaderField("Content-Location"), "data:" + m.ContentType + ";base64," + m.GetBodyEncoded)
        End If : m.Dispose()
    Next : chilkatWholeMime.Dispose()
    'Now it is time to replace the source attribute of all images in HTML with dataURI'
    Dim htmlDoc As New HtmlDocument : htmlDoc.LoadHtml(html) : Dim docNode As HtmlNode = htmlDoc.DocumentNode
    For i = 0 To allImages.Count - 1
        'Select all images, whose src attribute is equal to saved URL'
        Dim keyURL As String = allImages.GetKey(i) 'Saved url from MHT'
        Dim elementsWithPics() As HtmlNode = docNode.QuerySelectorAll("img[src='" + keyURL + "']").ToArray
        Dim imgsrc As String = allImages.GetValues(i)(0) 'dataURI as base64 string'
        For j = 0 To elementsWithPics.Length - 1
            elementsWithPics(j).SetAttributeValue("src", imgsrc)
        Next
        'Select all elements, whose style attribute contains saved URL'
        elementsWithPics = docNode.QuerySelectorAll("[style~='" + keyURL + "']").ToArray
        For j = 0 To elementsWithPics.Length - 1
            'Get and modify style'
            Dim modStyle As String = Strings.Replace(elementsWithPics(j).GetAttributeValue("style", String.Empty), keyURL, imgsrc, 1, 1, 1)
            elementsWithPics(j).SetAttributeValue("style", modStyle)
        Next : Erase elementsWithPics
    Next
    'Get final html'
    Dim tw As New StringWriter()
    htmlDoc.Save(tw) : html = tw.ToString : tw.Close() : tw.Dispose()
    Return html
End Function

I think that @XGundam05 is correct. Here is what I did to make it work.

I started with a Windows Form project in Visual Studio. Added the WebBrowser to the form and then added two buttons. Then this code:

    private void button1_Click(object sender, EventArgs e)
    {
        webBrowser1.ShowSaveAsDialog();
    }

    private void button2_Click(object sender, EventArgs e)
    {
        webBrowser1.Url = new Uri("localfile.mht");
    }

You should be able to take this code and add in a list of files and process each one with a foreach. The webBrowser contains a method called ShowSaveAsDialog(); And this will allow one to save as .mht or just the html or the complete page.

EDIT: You could use the webBrowser's Document and scrape the information at this point. By adding a richTextBox and a public variable as per MS here: http://msdn.microsoft.com/en-us/library/ms171713.aspx

    public string Code
    {
        get
        {
            if (richTextBox1.Text != null)
            {
                return (richTextBox1.Text);
            }
            else
            {
                return ("");
            }
        }
        set
        {
            richTextBox1.Text = value;
        }
    }


    private void button2_Click(object sender, EventArgs e)
    {
        webBrowser1.Url = new Uri("localfile.mht");
        HtmlElement elem;

        if (webBrowser1.Document != null)
        {
            
            HtmlElementCollection elems = webBrowser1.Document.GetElementsByTagName("HTML");
            if (elems.Count == 1)
            {
                elem = elems[0];
                Code = elem.OuterHtml;
                foreach (HtmlElement elem1 in elems)
                {
                    //look for pictures to save
                }
                
            }
        }
    }

So automating IE was difficult and not usable end to end, so I think building some sort of code that does it will be the way to go. on github I found this python one which may be good

https://github.com/Modified/MHTifier http://decodecode.net/elitist/2013/01/mhtifier/

If I have time i'll try to do something similar in PowerShell.

Firefox has embedded tool. Go to menu (press Alt if hidden) File->Convert saved pages.

Step 1 : Open the .MHT / .MHTML file in Browser.

Step 2 : Right click to select to view the source code.

Step 3: Copy the source code and paste it to a new .TXT file, then change the file extension to .HTML.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top