Question

I'm attempting to remove noise words from a string, and I have what I believe is a good algorithm for it, but I'm running into a snag. Before I do my preg_replace I remove all punctuation except apostrophe ('). The I put it through this preg_replace:

$content = preg_replace('/\b('.implode('|', self::$noiseWords).')\b/','',$content);

Which works great, except for words that do indeed have that ' character. preg_replace seems to be treating that as a boundary character. This is a problem for me.

Is there a way I can get around this? A different solution perhaps?

Thanks!

Here is the example I'm using:

$content = strtolower(strip_tags($content));
    $content = preg_replace("/(?!['])\p{P}/u", "", $content);// remove punctuation
    echo $content;// i've added striptags for editing as well should still workyep it doesnbsp

    $content = preg_replace("/\b(?<')(".implode('|', self::$noiseWords).")(?!')\b/",'',$content);

    $contentArray = explode(" ", $content);

    print_r($contentArray);

On the 3rd line you'll see the comment of what $content is right before the preg_replace

And though I'm assuming you can guess what my noiseWords array looks like, here's just a small fraction of it:

$noiseWords = array("a", "able","about","above","abroad","according","accordingly","across",
        "actually","adj","after","afterwards","again",......)
Was it helpful?

Solution

You can use a negative lookbehind and positive lookahead to make sure you're not "around" a quote character:

$regex = "/\b(?<!')(".implode('|', self::$noiseWords).")(?!')\b/";

Now, your regex will not match anything that is preceded by or following with a single quote.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top