Question

I have a value serialized by PHP that I need to decode in Clojure. I'm using this library to deserialize it; it uses Instaparse which utilizes EBNF/ABNF notation to define the grammar. For reference, here's the full definition:

<S> = expr
<expr> = (string | integer | double | boolean | null | array)+
<digit> = #'[0-9]'
<number> = negative* (decimal-num | integer-num)
<negative> = '-'
<integer-num> = digit+
<decimal-num> = integer-num '.' integer-num
<zero-or-one> = '0'|'1'
size = digit+
key = (string | integer)
<val> = expr
array = <'a:'> <size> <':{'> (key val)+ <'}'> <';'>?
boolean = <'b:'> zero-or-one <';'>
null = <'N;'>
integer = <'i:'> number <';'>
double = <'d:'> number <';'>
string = <'s:'> <size> <':\\\"'> #'([^\"]|\\.)*' <'\\\";'>

I've found a bug in this library - it can't handle serialized strings that contain the " character.

php > echo serialize('{"key":"value"}');
s:15:"{"key":"value"}";

Deserialized using the library, it blows up when it finds that second ":

> (deserialize-php "s:15:\"{\"key\":\"value\"}\";")
[:index 7]

The problem exists on this line of the grammar definition:

string = <'s:'> <size> <':\\\"'> #'([^\"]|\\.)*' <'\\\";'>

You'll notice that the string definition excludes the " character. That's not correct though, I could have any character in that string; the size is what matters. I'm not a BNF expert, so I'm trying to figure out what my options here are.

Is it possible to use the size as the correct number of characters to grab? If that's not possible, does someone see a way I can tweak the grammar definition to enable correct parsing?

Was it helpful?

Solution

As stated by Arthur Ulfeldt, this grammar is not context-free due to the bencoded strings. Nonetheless, it is a simple one to parse, just not with A/EBNF. For example, using Parse-EZ instead:

A convenience macro:

(defmacro tagged-sphp-expr [tag parser] 
  `(fn [] (between #(string ~(str tag ":")) #(~parser) #(string ";"))))

The rest:

(def sphp-integer (tagged-sphp-expr "i" integer))

(def sphp-decimal (tagged-sphp-expr "d" decimal))

(defn sphp-boolean [] 
  (= \1 ((tagged-sphp-expr "b" #(chr-in "01")))))

(defn sphp-null [] (string "N;") :null)

(defn sphp-string []
  (let [tag (string "s:")
        size (integer)
        open (no-trim #(string ":\""))
        contents (read-n size)
        close (string "\";")]
    contents))

(declare sphp-array)

(defn sphp-expr [] 
  (any #(sphp-integer) #(sphp-decimal) #(sphp-boolean) #(sphp-null) #(sphp-string) #(sphp-array)))

(defn sphp-key [] 
  (any #(sphp-string) #(sphp-integer)))

(defn sphp-kv-pair [] 
  (apply array-map (series #(sphp-key) #(sphp-expr))))

(defn sphp-array []
  (let [size (between #(string "a:") #(integer) #(string ":{"))
        contents (times size sphp-kv-pair)] 
    (chr \})
    (attempt #(chr \;))
    contents))

The test:

(def test-str "i:1;d:2;s:16:\"{\"key\": \"value\"}\";a:2:{s:3:\"php\";s:3:\"sux\";s:3:\"clj\";s:3:\"rox\";};b:1;")

(println test-str)
;=> i:1;d:2;s:16:"{"key": "value"}";a:2:{s:3:"php";s:3:"sux";s:3:"clj";s:3:"rox";};b:1;

(parse #(multi* sphp-expr) test-str)
;=> [1 2.0 "{\"key\": \"value\"}" [{"php" "sux"} {"clj" "rox"}] true]

OTHER TIPS

I reasonably sure you can't write that with just a EBNF parser because as far as I understand it this grammar is not context-free.

I think the closest you could come in a context-free grammar is to explicitly enumerate all of the expected length prefixes - something along the lines of the ABNF:

 string = 's:0:"";' /
          's:1:"' CHAR '";' /
          's:2:"' 2CHAR '";' /
          's:3:"' 3CHAR '";' / ...

This might work reasonably well if the length of your strings is bounded, but obviously won't work for an arbitrarily-sized strings.

Otherwise, to correctly handle arbitrary-length strings, your best option is probably to parse by hand. Fortunately for a grammar of this size, that shouldn't be too difficult a task.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top