Question

I am trying to parse a FDF file using PHP, and regex. But I just cant get my head around regex. I am stuck parsing the file to generate a array.

%FDF-1.2
%âãÏÓ
1 0 obj 
<<
/FDF 
<<
/Fields [
<<
/V (email@email.com)
/T (field_email)
>> 
<<
/V (John)
/T (field_name)
>> 
<<
/V ()
/T (field_reference)
>>]
>>
>>
endobj 
trailer

<<
/Root 1 0 R
>>
%%EOF

Current function (source:http://php.net/manual/en/ref.fdf.php)

function parse2($file) {
 if (!preg_match_all("/<<\s*\/V([^>]*)>>/x", $file,$out,PREG_SET_ORDER))
         return;
 for ($i=0;$i<count($out);$i++) {
         $pattern = "<<.*/V\s*(.*)\s*/T\s*(.*)\s*>>";
         $thing = $out[$i][1];
         if (eregi($pattern,$out[$i][0],$regs)) {
                 $key = $regs[2];
                 $val = $regs[1];
                 $key = preg_replace("/^\s*\(/","",$key);
                 $key = preg_replace("/\)$/","",$key);
                 $key = preg_replace("/\\\/","",$key);
                 $val = preg_replace("/^\s*\(/","",$val);
                 $val = preg_replace("/\)$/","",$val);
                 $matches[$key] = $val;
         }
 }
 return $matches;
}

Result:

Array
(
    [field_email)
    ] => email@email.com)

    [field_name)
    ] => John)

    [field_reference)
    ] => )

)

Why does it conclude the ) and new line? I know this problem is trivial for someone that understands regex expressions. So help would be appreciated.

Was it helpful?

Solution

Description

Your initial expression simply finds the entire block of text which represents each key and value set. Then in your clean up section, you're looking for a close paran which is followed immediately by a end of string \)$ but I'm sure there are additional characters between the close paran and the end of the string.

Instead I'd handle all this in one operation. This expression will:

  • find the field value
    • trim the surrounding parens off
    • and place into capture group 1
  • find the name of the value and place into capture group 2
    • trim the field_ substring off
    • trim the surrounding parens off
    • and place into capture group 2
  • requires the options: case insensitive, and multi-line

^\/V\s\(([^)]*)\)[\r\n]*^\/T\s\(field_([^)]*)\)

enter image description here

Example

Live Demo

Sample Text

%FDF-1.2
%âãÏÓ
1 0 obj 
<<
/FDF 
<<
/Fields [
<<
/V (email@email.com)
/T (field_email)
>> 
<<
/V (John)
/T (field_name)
>> 
<<
/V ()
/T (field_reference)
>>]
>>
>>
endobj 
trailer

<<
/Root 1 0 R
>>
%%EOF

Matches

[0][0] = /V (email@email.com)
/T (field_email)
[0][1] = email@email.com
[0][2] = email

[1][0] = /V (John)
/T (field_name)
[1][1] = John
[1][2] = name

[2][0] = /V ()
/T (field_reference)
[2][1] = 
[2][2] = reference



Or

If you wanted retain the field_ substring, then you can simply remove that from the expression like so:

^\/V\s\(([^)]*)\)[\r\n]*^\/T\s\(([^)]*)\)

enter image description here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top