Pattern Matching XML in F#
-
01-07-2021 - |
문제
New Library: XParsec
This question has lead to a stream-and-type-independent, non-linear, extensible parsec implementation in F# 3.0 - inspired by FParsec, freed from Chars and linear Streams and simplified: http://corsis.github.com/XParsec/
Pattern
1 = < font=?'Bold' bbox=F'l ..' s > ; < ~s >*
2 = < font=!'Bold' bbox=F'l ..' s=?'(' > | [ 1.l < 2.l ] ; < ~s >*
3 = < font=!'Bold' bbox=F'l ..' s=?')' > | [ 1.l < 3.l ]
where
element names are left unspecified
font, bbox and s are attributes
V = string, N = string
? :: V -> bool -- value contains string
! :: V -> bool = not . (?) -- value does not contain string
~ :: N -> bool -- value of attribute N is empty or whitespace
F :: V -> [(N, float)] -- extracts a list of named floats from value
RM :: V -> bool -- value matches regular expression
[] :: [bool] -- list of conditions
Code
open System.Xml.Linq
open System.Collections.Generic
let inline (-?-) a b = (a : string).Contains b
let inline (~~) s = s |> String.IsNullOrWhiteSpace
let inline (!>) x = ( ^a : (static member op_Implicit : ^b -> ^a) x )
let inline (@) (x : XElement) n = let a = x.Attribute(!> n) in if a <> null then a.Value else String.Empty
let inline (@<) (x : XElement) n v = x.SetAttributeValue(!> n, v)
type XE = XElement IEnumerator
let inline bbox e = (e @ "bbox") |> fun s -> s.Split [| ' ' |] |> Seq.map float |> Seq.toList
let inline left bbox = match bbox with l::_ -> l | _ -> nan
let mark n = let id = Guid.NewGuid() in Seq.iter <| fun e -> e @< "class-" + n <| id
let speaker (n : XE) =
let c1 = n.Current
if c1 @? "font" <| "Bold"
then let l1 = c1 |> bbox |> left
while n.MoveNext() && ~~(n.Current @ "s") do ()
let c2 = n.Current
if (c2 @ "font") -?- "Bold" |> not
then let l2 = c2 |> bbox |> left
if l1 < l2
then let s2 = c2 @ "s"
if s2 -?- "("
then if s2 -?- ")"
then [c1; c2] |> mark "speaker"
while n.MoveNext() && ~~(n.Current @ "s") do ()
let c3 = n.Current
if (c3 @ "font") -?- "Bold" |> not
then let l3 = c3 |> bbox |> left
if l1 < l3
then if (c3 @ "s") -?- ")"
then [c1; c2; c3] |> mark "speaker"
let test (x : XElement) =
let spans = x.Descendants(!> "span") |> Seq.toArray
for i = 29 to spans.Length - 1 do
let n = (spans |> Seq.skip i).GetEnumerator()
n.MoveNext() |> ignore
speaker n
Input
<doc>
<block bbox="63.2999 550.846 246.865 561.875">
<line bbox="63.2999 550.846 246.865 561.875">
<span bbox="63.2999 550.846 189.001 561.875" font="TimesNewRoman,Bold" size="9.96" s="Dr. Frank-Walter Steinmeier " />
<span bbox="189 550.846 246.865 561.875" font="TimesNewRoman" size="9.96" s="(SPD) . . . . . ." />
</line>
</block>
<block bbox="63.2999 567.766 246.875 578.796">
<line bbox="63.2999 567.766 246.875 578.796">
<span bbox="63.2999 567.766 136.004 578.796" font="TimesNewRoman,Bold" size="9.96" s="Rainer Brüderle " />
<span bbox="136.02 567.766 246.875 578.796" font="TimesNewRoman" size="9.96" s="(FDP) . . . . . . . . . . . . . . . . ." />
</line>
</block>
<block bbox="63.2999 584.626 250.351 651.456">
<line bbox="63.2999 584.626 246.826 595.656">
<span bbox="63.2999 584.626 152.105 595.656" font="TimesNewRoman,Bold" size="9.96" s="Sahra Wagenknecht " />
<span bbox="152.16 584.626 246.826 595.656" font="TimesNewRoman" size="9.96" s="(DIE LINKE) . . . . . . ." />
</line>
<line bbox="63.2999 600.362 250.351 613.34">
<span bbox="63.2999 601.546 139.327 612.576" font="TimesNewRoman,Bold" size="9.96" s="Siegfried Kauder " />
<span bbox="139.38 601.546 247.762 612.576" font="TimesNewRoman" size="9.96" s="(Villingen-Schwenningen) " />
<span bbox="247.861 600.362 250.351 613.34" font="Symbol" size="9.96" s=" " />
</line>
<line bbox="74.6404 612.526 246.911 623.556">
<span bbox="74.6404 612.526 246.911 623.556" font="TimesNewRoman" size="9.96" s="(CDU/CSU) . . . . . . . . . . . . . . . . . . . . . . . ." />
</line>
<line bbox="63.2999 628.202 191.909 641.18">
<span bbox="63.2999 629.386 126.374 640.416" font="TimesNewRoman,Bold" size="9.96" s="Jürgen Trittin " />
<span bbox="126.419 629.386 189.433 640.416" font="TimesNewRoman" size="9.96" s="(BÜNDNIS 90/" />
<span bbox="189.419 628.202 191.909 641.18" font="Symbol" size="9.96" s=" " />
</line>
<line bbox="74.6394 640.426 246.813 651.456">
<span bbox="74.6394 640.426 246.813 651.456" font="TimesNewRoman" size="9.96" s="DIE GRÜNEN) . . . . . . . . . . . . . . . . . . . . ." />
</line>
</block>
</doc>
Output
<doc>
<block>
<line>
<span font="TimesNewRoman,Bold" size="9.96" s="Dr. Frank-Walter Steinmeier " class-speaker="1f2e4dca-80d5-4c5e-91b6-6bd2e4a8acaf" />
<span font="TimesNewRoman" size="9.96" s="(SPD) . . . . . ." class-speaker="1f2e4dca-80d5-4c5e-91b6-6bd2e4a8acaf" />
</line>
</block>
<block>
<line>
<span font="TimesNewRoman,Bold" size="9.96" s="Rainer Brüderle " class-speaker="eaa75d02-0ac6-4480-bcbe-f17bddfe6e81" />
<span font="TimesNewRoman" size="9.96" s="(FDP) . . . . . . . . . . . . . . . . ." class-speaker="eaa75d02-0ac6-4480-bcbe-f17bddfe6e81" />
</line>
</block>
<block>
<line>
<span font="TimesNewRoman,Bold" size="9.96" s="Sahra Wagenknecht " class-speaker="6b193f23-9b8b-4b37-9118-d8488fba25a2" />
<span font="TimesNewRoman" size="9.96" s="(DIE LINKE) . . . . . . ." class-speaker="6b193f23-9b8b-4b37-9118-d8488fba25a2" />
</line>
<line>
<span font="TimesNewRoman,Bold" size="9.96" s="Siegfried Kauder " class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
<span font="TimesNewRoman" size="9.96" s="(Villingen-Schwenningen) " class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
<span font="Symbol" size="9.96" s=" " />
</line>
<line>
<span font="TimesNewRoman" size="9.96" s="(CDU/CSU) . . . . . . . . . . . . . . . . . . . . . . . ." class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
</line>
<line>
<span font="TimesNewRoman,Bold" size="9.96" s="Jürgen Trittin " class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
<span font="TimesNewRoman" size="9.96" s="(BÜNDNIS 90/" class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
<span font="Symbol" size="9.96" s=" " />
</line>
<line>
<span font="TimesNewRoman" size="9.96" s="DIE GRÜNEN) . . . . . . . . . . . . . . . . . . . . ." class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
</line>
</block>
</doc>
Question
To automatically go from concise pattern declarations to running code, I am thinking of doing the following:
- parse pattern declarations using FParsec to an AST
- evaluate the AST
but before I do anything, I would like to know:
- can anyone write an (applicative) EDSL (/ parts thereof) for declaring code using F# functions and composition directly without resorting to ASTs?
- is there a library capable of similar pattern matches on XML?
- does anyone have any comments regarding my approach?
해결책
I'm not entirely sure what the question is. Do you want to work with some existing pattern matching language for XML that already exists (as in your example), or are you asking, in general, about XML processing?
LINQ to XML As said by Jack P., probably the best option for XML processing in F# is to just use standard Seq
functions. This gets a bit longer, but it is very readable.
XPath Another alternative is to use XPath, which is very concise (and completely standard) way of selecting nodes in XML. For example, to select a node with a specified value of an attribute, you can write:
#r "System.Xml.Linq"
open System.Xml.Linq
open System.Xml.XPath
let doc = XDocument.Parse("<r><a n=\"f\">foo</a><a n=\"b\">bar</a></r>")
doc.XPathSelectElement("//a[@*='f']").Value
Domain-specific languages If you want to write your own concise language, then you do not necessarily go with a custom parser etc. You can get quite far with just using F# operators in a clever way. This F# snippet is a great example. You can write things like:
printfn "Matches: %A" (!/foo/bar/(baz@=true)/quux/= x)
printfn "Doesn't Match: %A" (!/foo/bar/(baz@=false)/quux/= x)
printfn "Values: %A" (!/foo/bar/quux/(baz@=42)/! x)
Note that the patterns are type checked by the F# compiler.
다른 팁
IMO, the best way you could implement something like this would be to layer it on top of XLinq (LINQ-to-XML). It'd likely be simpler to code, easier to maintain, and provide equivalent (if not better) performance compared to implementing all of the pattern-matching logic "manually".
You could still use your DSL, mind you, to define the patterns to be matched. Basically, you'd use FParsec to parse your DSL into an AST, then traverse the AST and transform it into an equivalent LINQ query (see the System.Linq.Expressions namespace
). Once you had the LINQ expression representing your query, you'd apply it to any number of XDocument
s to execute the pattern-match.
You might also be interested in reading Erik Meijer's paper, XLinq: XML Programming Refactored (The Return Of The Monoids), which talks about XML programming from a functional design perspective.
To question 2:
I don't think I understand the F# code, but I wrote a xml pattern matching library in Pascal (online example, cli example), which you might want to look at. (although I call the patterns "templates", and it only selects the xml nodes, not modifies them).
With my templates, this
<doc>
<block>
<line>
<span>{.}</span>*
</line>*
</block>*
</doc>
would match all your spans in the input. (as would <span>{.}</span>*
)
Or another example:
<doc>
<block>
<line>
<span font="TimesNewRoman,Bold">{@s}</span>*
</line>*
</block>*
</doc>
would match the attributes containing the speaker's name.