Question

In Haskell, its straightforward to create a datatype for for a recursive tree, like what we have with XML documents:

data XML = 
    Text String       -- Text of text node
  | Elem String [XML] -- Tagname; Child nodes

and its related folds:

-- Simple fold (Child trees don't see the surrounding context.)
foldt :: (String -> a) -> (String -> [a] -> a) -> XML -> a
foldt fT fE (Text text) = fT text
foldt fT fE (Elem tagname ts) = fE tagname (map (foldt fT fE) ts)

-- Threaded fold for streaming applications.
-- See http://okmij.org/ftp/papers/XML-parsing.ps.gz
foldts :: (a -> String -> a) -> (a -> String -> a) ->  (a -> a -> String -> a) -> a -> XML -> a
foldts fT fE_begin fE_end = go
  where
    go seed (Text text) = fT seed text
    go seed (Elem tag children) = 
        let seed' = fE_begin seed tag in
        let seed'' = foldl go seed' children in
        fE_end seed seed'' tag

My problem now is that I don't know how to add some extra restrictions to my tree datatype in order to model HTML. In HTML each element node can only appear in the correct contexts and each element corresponds to a different context for its children. For example:

  • Void elements like img have an empty context model and are not allowed to have any children.
  • Elements with a Text content model, such as title, can only have text nodes as children (no nested tags allowed)
  • div elements cannot appear in a Phrasing context and therefore are not allowed to be descendants of span elements.

So my questions are:

  1. What would I have to do to model these restrictions in Haskell98? (I ask this because I guess the Haskell98 model should translate better to other programming languages)

    I imagine we might have to create lots of different datatypes for the different contexts but I don't know how to do this in a principled and clear manner. How can I do this without getting lost and what would the fold functions end up looking like?

  2. What would the model look like if we are allowed to use modern GHC features such as GADTs?

    I have a hunch GADTs might help push the restrictions into the typechecker, keeping the fold functions simple but I'm don't have much experience with them...

I don't need a 100% functioning solution, since that would obviously be beyond the scope of a Stack Overflow discussion. I just need enough to be able to get a better grasp of GADTs and things like and to be able to do the rest by myself.

Was it helpful?

Solution 2

This has been done by the OCsigen project, a web framework implemented in OCaml, that seeks to provide strong typing guarantee.

You can have a look at their Html5 interface on this documentation page. See for example the type of the img smart constructor (it's a mouthful!):

val img : 
  src:Xml.uri ->
  alt:Html5_types.text ->
  ([< `Accesskey
    | `Class
    | `Contenteditable
    | `Contextmenu
    | `Dir
    | `Draggable
    | `Height
    | `Hidden
    | `Id
    | `Ismap
    | `OnAbort
    | `OnBlur
    | `OnCanPlay
    | `OnCanPlayThrough
    | `OnChange
    | `OnClick
    | `OnContextMenu
    | `OnDblClick
    | `OnDrag
    | `OnDragEnd
    | `OnDragEnter
    | `OnDragLeave
    | `OnDragOver
    | `OnDragStart
    | `OnDrop
    | `OnDurationChange
    | `OnEmptied
    | `OnEnded
    | `OnError
    | `OnFocus
    | `OnFormChange
    | `OnFormInput
    | `OnInput
    | `OnInvalid
    | `OnKeyDown
    | `OnKeyPress
    | `OnKeyUp
    | `OnLoad
    | `OnLoadStart
    | `OnLoadedData
    | `OnLoadedMetaData
    | `OnMouseDown
    | `OnMouseMove
    | `OnMouseOut
    | `OnMouseOver
    | `OnMouseUp
    | `OnMouseWheel
    | `OnPause
    | `OnPlay
    | `OnPlaying
    | `OnProgress
    | `OnRateChange
    | `OnReadyStateChange
    | `OnScroll
    | `OnSeeked
    | `OnSeeking
    | `OnSelect
    | `OnShow
    | `OnStalled
    | `OnSubmit
    | `OnSuspend
    | `OnTimeUpdate
    | `OnVolumeChange
    | `OnWaiting
    | `Spellcheck
    | `Style_Attr
    | `Tabindex
    | `Title
    | `User_data
    | `Width
    | `XML_lang
    | `XMLns ],
   [> `Img ])
  nullary

(In OCaml's standard syntax, a type t with three type parameters a, b and c is written (a,b,c) t rather than t a b c).

The fact that <img> may have no child is encoded by the use of the "nullary" type here. The rest of the static information encodes which kinds of attributes may be used on this node.

The weird `Foo | `Bar | `Baz stuff is a so-called "polymorphic variant" (presented in eg. this article), a kind of extensible structural sum type using row polymorphism that is more or less unique to OCaml (while they would be useful to any programming language, as extensible records generalize the usual nominal records of OCaml and Haskell). Here there are mostly used as a form of type-level lists.

Other than that, that is a relatively classic use of phantom types, only lead to extreme sizes because of the sheer number of cases you have in the HTML spec. Phantom types are the precursors of GADT to enforce additional type abstraction in the ML world. In Haskell98, you would probably try to encode the same kind of type-level information using type-classes rather than directly type abstractions.

OTHER TIPS

This doesn't need GADTs (at least not yet). You just have to teach the compiler more information about your tree type.

data HTML
    = HTML HTMLHeader HTMLBody

data HTMLHeader
    = Header String

data HTMLBody
    = Body [HTMLContent]

data HTMLContent
    = Text HTMLText
    | Title HTMLText
    | Img  String
    | Elem String [HTML]

data HTMLText
    = Literal String
    | Bold String
    | Italic String
    | TextElems [HTMLText]

Now you get some invariants:

-- Must have header and body. titles can't contain images.

x = HTML
        (Header "TEST") $ Body [
             Title (Literal "Title")
            ,Text (Bold "Content")
        ]

A principled way to derive this tree would be to take it from a particular HTML grammar - e.g. the XML EBNF perhaps - http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-well-formed .

With GADTs some things can be encoded more efficiently, and you can write functions over your data types that can enforce stronger invariants.

As you start making more and more properties statically verifiable, it may become more complex to encode the invariants. That's when GADTs, type families and other extensions can start to help.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top