문제

New Library: XParsec

This question has lead to a stream-and-type-independent, non-linear, extensible parsec implementation in F# 3.0 - inspired by FParsec, freed from Chars and linear Streams and simplified: http://corsis.github.com/XParsec/


Pattern

1 = < font=?'Bold' bbox=F'l ..' s      >                       ;       < ~s >*
2 = < font=!'Bold' bbox=F'l ..' s=?'(' > | [ 1.l < 2.l ]       ;       < ~s >*
3 = < font=!'Bold' bbox=F'l ..' s=?')' > | [ 1.l < 3.l ]

where

element names are left unspecified
font, bbox and s are attributes

V = string, N = string

?  :: V -> bool                -- value          contains string
!  :: V -> bool = not . (?)    -- value does not contain  string
~  :: N -> bool                -- value of attribute N is empty or whitespace
F  :: V -> [(N, float)]        -- extracts a list of named floats from value
RM :: V -> bool                -- value matches regular expression
[] :: [bool]                   -- list of conditions

Code

open System.Xml.Linq
open System.Collections.Generic

let inline (-?-)  a b = (a : string).Contains b
let inline (~~) s = s |> String.IsNullOrWhiteSpace
let inline (!>) x = ( ^a : (static member op_Implicit : ^b -> ^a) x )

let inline (@)  (x : XElement) n   = let a = x.Attribute(!> n) in if a <> null then a.Value else String.Empty
let inline (@<) (x : XElement) n v = x.SetAttributeValue(!> n, v)

type XE = XElement IEnumerator

let inline bbox e    = (e @ "bbox") |> fun s -> s.Split [| ' ' |] |> Seq.map float |> Seq.toList
let inline left bbox = match bbox with l::_ -> l | _ -> nan

let mark n = let id = Guid.NewGuid() in Seq.iter <| fun e -> e @< "class-" + n <| id

let speaker  (n : XE) =
  let  c1 = n.Current
  if   c1 @? "font" <| "Bold"
  then let  l1 = c1 |> bbox |> left
       while n.MoveNext() && ~~(n.Current @ "s") do ()
       let  c2 = n.Current
       if  (c2 @ "font") -?- "Bold" |> not
       then let  l2 = c2 |> bbox |> left
            if   l1 < l2
            then let  s2 = c2 @ "s"
                 if        s2 -?- "("
                 then if   s2 -?- ")"
                      then [c1; c2] |> mark "speaker"
                      while n.MoveNext() && ~~(n.Current @ "s") do ()
                      let  c3 = n.Current
                      if  (c3 @ "font") -?- "Bold" |> not
                      then let  l3 = c3 |> bbox |> left
                           if   l1 < l3
                           then if   (c3 @ "s") -?- ")"
                                then [c1; c2; c3] |> mark "speaker"  


let test (x : XElement) =
  let spans = x.Descendants(!> "span") |> Seq.toArray
  for i = 29 to spans.Length - 1 do
    let n = (spans |> Seq.skip i).GetEnumerator()
    n.MoveNext() |> ignore
    speaker n

Input

<doc>
<block bbox="63.2999 550.846 246.865 561.875">
  <line bbox="63.2999 550.846 246.865 561.875">
    <span bbox="63.2999 550.846 189.001 561.875" font="TimesNewRoman,Bold" size="9.96" s="Dr. Frank-Walter Steinmeier " />
    <span bbox="189 550.846 246.865 561.875" font="TimesNewRoman" size="9.96" s="(SPD)  . . . . . ." />
  </line>
</block>
<block bbox="63.2999 567.766 246.875 578.796">
  <line bbox="63.2999 567.766 246.875 578.796">
    <span bbox="63.2999 567.766 136.004 578.796" font="TimesNewRoman,Bold" size="9.96" s="Rainer Brüderle " />
    <span bbox="136.02 567.766 246.875 578.796" font="TimesNewRoman" size="9.96" s="(FDP) . . . . . . . . . . . . . . . . ." />
  </line>
</block>
<block bbox="63.2999 584.626 250.351 651.456">
  <line bbox="63.2999 584.626 246.826 595.656">
    <span bbox="63.2999 584.626 152.105 595.656" font="TimesNewRoman,Bold" size="9.96" s="Sahra Wagenknecht " />
    <span bbox="152.16 584.626 246.826 595.656" font="TimesNewRoman" size="9.96" s="(DIE LINKE)  . . . . . . ." />
  </line>
  <line bbox="63.2999 600.362 250.351 613.34">
    <span bbox="63.2999 601.546 139.327 612.576" font="TimesNewRoman,Bold" size="9.96" s="Siegfried Kauder " />
    <span bbox="139.38 601.546 247.762 612.576" font="TimesNewRoman" size="9.96" s="(Villingen-Schwenningen) " />
    <span bbox="247.861 600.362 250.351 613.34" font="Symbol" size="9.96" s=" " />
  </line>
  <line bbox="74.6404 612.526 246.911 623.556">
    <span bbox="74.6404 612.526 246.911 623.556" font="TimesNewRoman" size="9.96" s="(CDU/CSU) . . . . . . . . . . . . . . . . . . . . . . . ." />
  </line>
  <line bbox="63.2999 628.202 191.909 641.18">
    <span bbox="63.2999 629.386 126.374 640.416" font="TimesNewRoman,Bold" size="9.96" s="Jürgen Trittin " />
    <span bbox="126.419 629.386 189.433 640.416" font="TimesNewRoman" size="9.96" s="(BÜNDNIS 90/" />
    <span bbox="189.419 628.202 191.909 641.18" font="Symbol" size="9.96" s=" " />
  </line>
  <line bbox="74.6394 640.426 246.813 651.456">
    <span bbox="74.6394 640.426 246.813 651.456" font="TimesNewRoman" size="9.96" s="DIE GRÜNEN)  . . . . . . . . . . . . . . . . . . . . ." />
  </line>
</block>
</doc>

Output

<doc>
<block>
  <line>
    <span font="TimesNewRoman,Bold" size="9.96" s="Dr. Frank-Walter Steinmeier " class-speaker="1f2e4dca-80d5-4c5e-91b6-6bd2e4a8acaf" />
    <span font="TimesNewRoman" size="9.96" s="(SPD)  . . . . . ." class-speaker="1f2e4dca-80d5-4c5e-91b6-6bd2e4a8acaf" />
  </line>
</block>
<block>
  <line>
    <span font="TimesNewRoman,Bold" size="9.96" s="Rainer Brüderle " class-speaker="eaa75d02-0ac6-4480-bcbe-f17bddfe6e81" />
    <span font="TimesNewRoman" size="9.96" s="(FDP) . . . . . . . . . . . . . . . . ." class-speaker="eaa75d02-0ac6-4480-bcbe-f17bddfe6e81" />
  </line>
</block>
<block>
  <line>
    <span font="TimesNewRoman,Bold" size="9.96" s="Sahra Wagenknecht " class-speaker="6b193f23-9b8b-4b37-9118-d8488fba25a2" />
    <span font="TimesNewRoman" size="9.96" s="(DIE LINKE)  . . . . . . ." class-speaker="6b193f23-9b8b-4b37-9118-d8488fba25a2" />
  </line>
  <line>
    <span font="TimesNewRoman,Bold" size="9.96" s="Siegfried Kauder " class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
    <span font="TimesNewRoman" size="9.96" s="(Villingen-Schwenningen) " class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
    <span font="Symbol" size="9.96" s=" " />
  </line>
  <line>
    <span font="TimesNewRoman" size="9.96" s="(CDU/CSU) . . . . . . . . . . . . . . . . . . . . . . . ." class-speaker="a0162e4e-1167-412a-ac11-ac13ef1aa46e" />
  </line>
  <line>
    <span font="TimesNewRoman,Bold" size="9.96" s="Jürgen Trittin " class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
    <span font="TimesNewRoman" size="9.96" s="(BÜNDNIS 90/" class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
    <span font="Symbol" size="9.96" s=" " />
  </line>
  <line>
    <span font="TimesNewRoman" size="9.96" s="DIE GRÜNEN)  . . . . . . . . . . . . . . . . . . . . ." class-speaker="81fd6735-c57f-464b-a08f-7e7cb3bccfa8" />
  </line>
</block>
</doc>

Question

To automatically go from concise pattern declarations to running code, I am thinking of doing the following:

  • parse pattern declarations using FParsec to an AST
  • evaluate the AST

but before I do anything, I would like to know:

  1. can anyone write an (applicative) EDSL (/ parts thereof) for declaring code using F# functions and composition directly without resorting to ASTs?
  2. is there a library capable of similar pattern matches on XML?
  3. does anyone have any comments regarding my approach?
도움이 되었습니까?

해결책

I'm not entirely sure what the question is. Do you want to work with some existing pattern matching language for XML that already exists (as in your example), or are you asking, in general, about XML processing?

LINQ to XML As said by Jack P., probably the best option for XML processing in F# is to just use standard Seq functions. This gets a bit longer, but it is very readable.

XPath Another alternative is to use XPath, which is very concise (and completely standard) way of selecting nodes in XML. For example, to select a node with a specified value of an attribute, you can write:

#r "System.Xml.Linq"
open System.Xml.Linq
open System.Xml.XPath

let doc = XDocument.Parse("<r><a n=\"f\">foo</a><a n=\"b\">bar</a></r>")
doc.XPathSelectElement("//a[@*='f']").Value

Domain-specific languages If you want to write your own concise language, then you do not necessarily go with a custom parser etc. You can get quite far with just using F# operators in a clever way. This F# snippet is a great example. You can write things like:

printfn "Matches: %A" (!/foo/bar/(baz@=true)/quux/= x)
printfn "Doesn't Match: %A" (!/foo/bar/(baz@=false)/quux/= x)
printfn "Values: %A" (!/foo/bar/quux/(baz@=42)/! x)

Note that the patterns are type checked by the F# compiler.

다른 팁

IMO, the best way you could implement something like this would be to layer it on top of XLinq (LINQ-to-XML). It'd likely be simpler to code, easier to maintain, and provide equivalent (if not better) performance compared to implementing all of the pattern-matching logic "manually".

You could still use your DSL, mind you, to define the patterns to be matched. Basically, you'd use FParsec to parse your DSL into an AST, then traverse the AST and transform it into an equivalent LINQ query (see the System.Linq.Expressions namespace). Once you had the LINQ expression representing your query, you'd apply it to any number of XDocuments to execute the pattern-match.

You might also be interested in reading Erik Meijer's paper, XLinq: XML Programming Refactored (The Return Of The Monoids), which talks about XML programming from a functional design perspective.

To question 2:

I don't think I understand the F# code, but I wrote a xml pattern matching library in Pascal (online example, cli example), which you might want to look at. (although I call the patterns "templates", and it only selects the xml nodes, not modifies them).

With my templates, this

<doc>
<block>
  <line>
    <span>{.}</span>*
  </line>*
</block>*
</doc>

would match all your spans in the input. (as would <span>{.}</span>*)

Or another example:

<doc>
<block>
  <line>
    <span font="TimesNewRoman,Bold">{@s}</span>*
  </line>*
</block>*
</doc>

would match the attributes containing the speaker's name.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top