Extract keywords/tags from string using Preg_match_all
-
10-07-2019 - |
Question
I have the following code
$str = "keyword keyword 'keyword 1 and keyword 2' another 'one more'".'"another keyword" yes,one,two';
preg_match_all('/"[^"]+"|[^"\' ,]+|\'[^\']+\'/', $str, $matches);
echo "<pre>"; print_r($matches); echo "</pre>";
Where I want it to extract keywords from a string, and keep those wrapped within single or double quotes together, this above code works OK, but it returns the values with the quotes in it. I know I can remove these via str_replace or similar, but I'm really looking for a way to solve this via the preg_match_all function.
Output:
Array
(
[0] => Array
(
[0] => keyword
[1] => keyword
[2] => 'keyword 1 and keyword 2'
[3] => another
[4] => 'one more'
[5] => "another keyword"
[6] => yes
[7] => one
[8] => two
)
)
Also, I think my regex is a little be soppy, so any suggestions for a better would would be good :)
Any suggestions / help would be greatly appreciated.
Solution
You've almost got it; you just need to use lookarounds to match the quotes:
'/(?<=\')[^\'\s][^\']*+(?=\')|(?<=")[^"\s][^"]*+(?=")|[^\'",\s]+/'
OTHER TIPS
preg_match_all('/"([^"]+)"|[^"\' ,]+|\'([^\']+)\'/',$str,$matches);
and use $matches[1]
and $matches[2]
.
this requires a simple function to get what you want, but it works
preg_match_all('/"([^"]+)"|([^"\' ,]+)|\'([^\']+)\'/',$str,$matches);
function r($str) {
return str_replace(array('\'','"'), array(''), $str);
}
$a = array_map('r', $matches[0]);
print_r($a);
Take a look at this tokenizeQuote
function in the comments to the strtok
function.
Edit You need to modify the function because the original only works with double quotes:
function tokenizeQuoted($string)
{
for ($tokens=array(), $nextToken=strtok($string, ' '); $nextToken!==false; $nextToken=strtok(' ')) {
$firstChar = $nextToken{0};
if ($firstChar === '"' || $firstChar === "'") {
$nextToken = $nextToken{strlen($nextToken)-1} === $firstChar
? substr($nextToken, 1, -1)
: substr($nextToken, 1) . ' ' . strtok($firstChar);
}
$tokens[] = $nextToken;
}
return $tokens;
}
Edit Maybe you should just write your own parser:
$tokens = array();
$buffer = '';
$quote = null;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
$char = $str{$i};
if ($char === '"' || $char === "'") {
if ($quote === null) {
if ($buffer !== '') {
$tokens[] = $buffer;
$buffer = '';
}
$quote = $char;
continue;
}
if ($quote == $char) {
$tokens[] = $buffer;
$buffer = '';
$quote = null;
continue;
}
} else if ($char === ',' || $char === ' ') {
if ($quote === null) {
if ($buffer !== '') {
$tokens[] = $buffer;
$buffer = '';
}
continue;
}
}
$buffer .= $char;
}
if ($buffer !== '') {
$tokens[] = $buffer;
}