Question

I need to trim words from begining and end of string. Problem is, sometimes the words can be abbreviated ie. only first three letters (followed by dot).

I tried hard to find suitable regular expression. Basicaly I need to chatch three or more initial characters up to length of replacement, but I cannot find regular expression, that will match variable length and will keep order of characters.

For example, if I need to trim 'insurance' from sentence 'insur. companies are rich', then pattern \^[insurance]{3,9}\ comes to my mind, but this pattern will also catch words like 'sensace', because order of characters (and their occurance) inside [] is not important for regexp.

Also, at end of string, I need remove serial-numbers, that are abbreviated from beginig - say 'XK-25F14' is sometimes presented as '25F14'. So I decided to go purely with character by character comparison.

Therefore I end with following php function

function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
    $pos = 0;
    $func = $case_insensitive ? 'strncasecmp' : 'strncmp';

    // Get number of initial characters, that match in both strings 
    while ($func($s, $dirt, $pos + 1) === 0)
        $pos++;

    // If more than 2 initial characters match, then remove the match   
    if ($pos > 2)
        $s = substr($s, $pos);

    // Reverse $s and $dirt so it will trim from the end of string
    $s = strrev($s);        
    if ($reverse)
        return trimWords($s, strrev($dirt), $case_insensitive, false);

    // After second run return back-reversed string 
    return trim($s, ' .-');
}

I'm happy with this function, but it has one drawback. It trims only one occurence of word. How to make it trim more occurances, i.e. remove both 'insurance ' from 'Insurance insur. companies'.

And I'm also curious, it realy does not exists such regular expression, that will match variable length and will respect order of characters in pattern?

Final solution

Thanks to mrhobo I have ended with function based on regular expression. This function can be easily improved and shall also be the most efficient for this task.

I have modified my previous function and it is two times quicker than regexp, but it can remove only one word per single run, so to be able to remove word from begin and end, it has to runs itself twice and performance is same as regexp and to remove more than one occurance of word, it has to runs itself multiple times, which will then be more and more slower.

The final function goes like this.

function trimWords($string, $word, $case_insensitive = false, $min_abbrv = 3)
{
    $exc = substr($word, $min_abbrv);
    $pat = null;

    $i = strlen($exc);
    while ($i--)
        $pat = '(?>'.preg_quote($exc[$i], '#').$pat.')?';

    $pat = substr($word, 0, $min_abbrv).$pat;
    $pat = '#(?<begin>^)?(?:\W*\b'.$pat.'\b\W*)+(?(begin)|$)#';
    if ($case_insensitive)
        $pat .= 'i';

    return preg_replace($pat, '', $string);
}

NOTE: with this function, it does not matter, if abbreviation ends with dot or not, it wipes out any shorter form of word and also removes all nonword characters around the word.

EDIT: I just tried create replace pattern like insu(r|ra|ran|ranc|rance) and function with atomic groups is faster by ~30% and with longer words it could be possibly even more efficient.

Était-ce utile?

La solution

Matching a word and all possible abbreviations from the nth letter isn't quite an easy task in regex.

Here is how I would do it for the word insurance from the 4th letter:

insu(?>r(?>a(?>n(?>c(?>(?<last>e))?)?)?)?)?(?(last)|\.)

http://regex101.com/r/aL2gV4

It works by using atomic groups to force the regex engine as far as possible forward past the last 'rance' letters using the nested pattern (?>a(?>b)?)?. If the last letter letter is matched we're not dealing with an abbreviation thus no dot is required, otherwise the dot is required. This is coded by (?(last)|\.).

Regular expression visualization

To trim, I would create a function to build the above regex for an abbreviation. Then you can write a while loop that replaces each of the abbreviation regexes with empty space until there are no more matches.

Non regex version

Here is my non regex version that removes multiple words and abbreviated words from a string:

function trimWords($str, $word, $min_abbrv, $case_insensitive = false) {
  $len      = 0;
  $word_len = strlen($word);
  $strlen   = strlen($str);
  $cmp      = $case_insensitive ? strncasecmp : strncmp;

  for ($i = 0; $i < $strlen; $i++) {
    if ($cmp($str[$i], $word[$len], $i) == 0) {
      $len++;
    } else if ($len > 0) {
      if ($len == $word_len || ($len >= $min_abbrv && ($dot = $str[$i] == '.'))) {
        $i     -= $len;
        $len   += $dot;
        $str    = substr($str, 0, $i) . substr($str, $i+$len);
        $strlen = strlen($str);
        $dot    = 0;
      }
      $len = 0;
    }
  }

  return $str;
}

Example:

$string = 'ins. <- "ins." / insu. insuranc. insurance / insurance. <- "."';
echo trimWords($string, 'insurance', 4);

Output is:

ins. <- "ins." / / . <- "."

Autres conseils

I wrote function that constructs regular expression pattern according to mrhobo and also simple test and benchmarked it against my function with pure PHP string comparison.

Here is the code:

$string = 'Insur. companies are nasty rich';
$dirt = 'insurance';
$cycles = 500000;


$start = microtime(true);

$i = $cycles;
while ($i) {
    $i--;
    regexpStyle($string, $dirt, true);
}

$stop = microtime(true);

$i = $cycles;
while ($i) {
    $i--;
    trimWords($string, $dirt, true);
}

$end = microtime(true);

$res1 = $stop - $start;
$res2 = $end - $stop;


$winner = $res1 < $res2 ? '<<<' : '>>>';

echo 'regexp: '.$res1.' '.$winner.' string operations: '.$res2;

function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
    $pos = 0;
    $func = $case_insensitive ? 'strncasecmp' : 'strncmp';

    // Get number of initial characters, that match in both strings 
    while ($func($s, $dirt, $pos + 1) === 0)
        $pos++;

    // If more than 2 initial characters match, then remove the match   
    if ($pos > 2)
        $s = substr($s, $pos);

    // After second run return back-reversed string 
    return trim($s, ' .-');
}

function regexpStyle($s, $dirt, $case_insensitive, $min_abbrev = 3)
{
    $ss = substr($dirt, $min_abbrev);
    $arr = str_split($ss);
    $patt = '(?>(?<last>'.array_pop($arr).'))?';
    $i = count($arr);
    while ($i)
        $patt = '(?>'.$arr[--$i].$patt.')?';
    $patt = '#^'.substr($dirt, 0, $min_abbrev).$patt.'(?(last)|\.)#';
    $patt .= $case_insensitive ? 'i' : null;
    return trim(preg_replace($patt, '', $s));
}

and the winner is... moment of silence... it is...

a draw

regexp: 8.5169589519501 >>> string operations: 8.0951890945435

but I have strong feeling that regexp approach could be better utilized.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top