Question

I'm looking around for a RegEx that can help me parse an nquad file. An nquad file is a straight text file where each line represents a quad (s, p, o, c):

<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .

The objects can also be literals (instead of uris), in which case they are enclosed with double quotes:

<http://mysubject> <http://mypredicate> "My object" <http://mycontext> .

I'm looking for a regex that given one line of this file, which will give me back a php array in the following format:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "http://myobject"
[3] => "http://mycontext"

...or in the case where the double quotes are used for the object:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "My Object"
[3] => "http://mycontext"

One final thing - in an ideal world, the regex will cater for the scenario there may be 1 or more spaces between the various components, e.g.

<http://mysubject>     <http://mypredicate>  "My object"       <http://mycontext> .
Était-ce utile?

La solution

I'm going to add another answer as an additional solution using only a regex and explode:

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

$delimeter = '---'; // Can't use space
$result = preg_replace('/<([^>]*)>\s+<([^>]*)>\s+(?:["<]){1}([^">]*)(?:[">]){1}\s+<([^>]*)>/i', '$1' . $delimeter . '$2' . $delimeter . '$3' . $delimeter . '$4', $line);
$array = explode( $delimeter, $result);

Autres conseils

It seems this can be accomplished as follows (I do not know your character restrictions so it may not work specifically for your needs, but worked for your test cases):

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

// Remove unnecessary whitespace between entries (change $line to $line2 for testing)
$delimeter = '---';
$result = preg_replace('/([">]){1}\s+(["<]){1}/i', '$1' . $delimeter . '$2', $line);

// Explode on our delimeter
$array = explode( $delimeter, $result);
foreach( $array as &$a)
{
    // Replace the characters we don't want with nothing
    $a = str_replace( array( '<', '.', '>', '"'), '', $a);
}

var_dump( $array);

This regular expression would help:

/(\S+?)\s+(\S+?)\s+(\S+?)\s+(\S+?)\s+\./

(s, p, o, c) values will be in $1, $2, $3, $4 variables.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top