PHP: trim word OR part of it from begining/end of string

Question 1

Matching a word and all possible abbreviations from the nth letter isn't quite an easy task in regex.

Here is how I would do it for the word insurance from the 4th letter:

insu(?>r(?>a(?>n(?>c(?>(?<last>e))?)?)?)?)?(?(last)|\.)

http://regex101.com/r/aL2gV4

It works by using atomic groups to force the regex engine as far as possible forward past the last 'rance' letters using the nested pattern (?>a(?>b)?)?. If the last letter letter is matched we're not dealing with an abbreviation thus no dot is required, otherwise the dot is required. This is coded by (?(last)|\.).

Regular expression visualization

To trim, I would create a function to build the above regex for an abbreviation. Then you can write a while loop that replaces each of the abbreviation regexes with empty space until there are no more matches.

Non regex version

Here is my non regex version that removes multiple words and abbreviated words from a string:

function trimWords($str, $word, $min_abbrv, $case_insensitive = false) {
  $len      = 0;
  $word_len = strlen($word);
  $strlen   = strlen($str);
  $cmp      = $case_insensitive ? strncasecmp : strncmp;

  for ($i = 0; $i < $strlen; $i++) {
    if ($cmp($str[$i], $word[$len], $i) == 0) {
      $len++;
    } else if ($len > 0) {
      if ($len == $word_len || ($len >= $min_abbrv && ($dot = $str[$i] == '.'))) {
        $i     -= $len;
        $len   += $dot;
        $str    = substr($str, 0, $i) . substr($str, $i+$len);
        $strlen = strlen($str);
        $dot    = 0;
      }
      $len = 0;
    }
  }

  return $str;
}

Example:

$string = 'ins. <- "ins." / insu. insuranc. insurance / insurance. <- "."';
echo trimWords($string, 'insurance', 4);

Output is:

ins. <- "ins." / / . <- "."

Question 2

I wrote function that constructs regular expression pattern according to mrhobo and also simple test and benchmarked it against my function with pure PHP string comparison.

Here is the code:

$string = 'Insur. companies are nasty rich';
$dirt = 'insurance';
$cycles = 500000;


$start = microtime(true);

$i = $cycles;
while ($i) {
    $i--;
    regexpStyle($string, $dirt, true);
}

$stop = microtime(true);

$i = $cycles;
while ($i) {
    $i--;
    trimWords($string, $dirt, true);
}

$end = microtime(true);

$res1 = $stop - $start;
$res2 = $end - $stop;


$winner = $res1 < $res2 ? '<<<' : '>>>';

echo 'regexp: '.$res1.' '.$winner.' string operations: '.$res2;

function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
    $pos = 0;
    $func = $case_insensitive ? 'strncasecmp' : 'strncmp';

    // Get number of initial characters, that match in both strings 
    while ($func($s, $dirt, $pos + 1) === 0)
        $pos++;

    // If more than 2 initial characters match, then remove the match   
    if ($pos > 2)
        $s = substr($s, $pos);

    // After second run return back-reversed string 
    return trim($s, ' .-');
}

function regexpStyle($s, $dirt, $case_insensitive, $min_abbrev = 3)
{
    $ss = substr($dirt, $min_abbrev);
    $arr = str_split($ss);
    $patt = '(?>(?<last>'.array_pop($arr).'))?';
    $i = count($arr);
    while ($i)
        $patt = '(?>'.$arr[--$i].$patt.')?';
    $patt = '#^'.substr($dirt, 0, $min_abbrev).$patt.'(?(last)|\.)#';
    $patt .= $case_insensitive ? 'i' : null;
    return trim(preg_replace($patt, '', $s));
}

and the winner is... moment of silence... it is...

a draw

regexp: 8.5169589519501 >>> string operations: 8.0951890945435

but I have strong feeling that regexp approach could be better utilized.

PHP: trim word OR part of it from begining/end of string

Final solution

Non regex version