How can I embed a raw html string into a Text.XmlHtml node structure

https://stackoverflow.com/questions/19190125

30-06-2022
|

Question

This is kind of a corner case. I'm running Haskell, Text.XmlHtml (version 0.2.3). I'm getting my source data from Pandoc (version 1.12). My source files are all in Markdown format.

The corner deals with when I have raw Html directly in my Markdown file. This is, of course, supported by the Markdown format, and sometimes is the only way for me to get the kind of Table layout that I want. Pandoc reads the file just file, but then when it gets to the Html section, what it emits is roughly like this:

[ RawInline (Format "html") "<a href=\"abcdefg\">"
, RawInline (Format "html") "<img src=\"image.png\" />"
, RawInline (Format "html") "</a>" ]

So... converting this into a hierarchical tree could get very complicated. The desired result, in XmlHtml would be something like this:

Element "a" [("href", "abcdefg")] [Element "img" [("src", "image.png")]]

But that is very difficult to get when I'm dealing with a structure that was hierarchical (everything else Pandoc emits is nicely hierarchical) and suddenly is not, but that "not hierarchical" part is only findable by basically building an Html parser. That works on multiple strings that surround other structures.

ideally, I would like to emit is a simple TextNode:

TextNode "<a href=\"abcdefg\"><img src=\"image.png\" /></a>"

I could do that either by emitting a bunch of TextNodes, one for each RawInline, or by glomming together the RawInline elements. The point is that I want to emit a TextNode that has raw Html in it and have that ultimately rendered without any extra Html escaping.

My renderer is ultimately a Heist snippet, but that probably means it runs by way of Blaze.

My final alternative, which might work, is to go from Pandoc through the Blaze Html renderer and then through the XmlHtml parser to get something that I can embed into a Heist snippet. I'd just like to avoid that because it feels dirty.

(I think I would actually run into the same problem if I wanted to put Java script into my Markdown documents... which is technically allowed by the language but probably very evil.)

Is there a way to do this, or am I too limited by my tools?

Update

I tried the route of rendering from Pandoc to Blaze to XmlHtml. Turns out that I get the same result, with the Html put into the final nodes in escaped from and thus appearing in the browser. Here was my function (which was much shorter and easier than the full implementation I'd done...)

pandocToHtml :: Pandoc.Pandoc -> [XmlHtml.Node]
pandocToHtml = Text.Blaze.Renderer.XmlHtml.renderHtmlNodes . Pandoc.writeHtml Pandoc.def

Pandoc.def includes all of the "allow_raw_*" extensions, including allow_raw_html.

Final thing I can think to do is to apply my own piecemeal html parser (and then maybe contribute it to Pandoc). Which, in the end, couldn't be horribly hard.

Solution

The only ways to do this are to either construct the nodes yourself like this:

Element "a" [("href", "abcdefg")] [Element "img" [("src", "image.png")]]

...or run your markup through the parser. This is by design. The contents of TextNode will always be escaped. XmlHmtl is not designed for pandoc style markdown. It is designed for XML and HTML. So you have to get your documents into that format first. It seems to me like you should be able to use pandoc to render the markdown to HTML and then run the XmlHtml parser on that.

XmlHtml does have a mechanism for interpreting certain portions of a document as raw text (as required by the HTML spec). You can see which tags are interpreted as raw text here. In 0.2.2 we updated XmlHtml to give the user more control over what gets treated as raw text. If you want a node to be treated as raw, just add the xmlhtmlRaw attribute to the tag. If you want a node that gets treated as raw by default to not be treated as raw, then add the xmlhtmlNotRaw attribute.

I'm not sure why renderHtmlNodes . writeHtml def didn't work for you. It seems like that should work. If it didn't work, I think Pandoc's writeHtml may be buggy. Since that didn't work, you might try parseHtml . writeHtmlString (pseudocode).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow