Question

I am trying to split the text into words:

$delimiterList = array(" ", ".", "-", ",", ";", "_", ":",
           "!", "?", "/", "(", ")", "[", "]", "{", "}", "<", ">", "\r", "\n",
           '"');
$words = mb_split($delimiterList, $string);

which works quite fine with strings but I am stuck in some cases where I have to do with numbers.

E.g. If I have the text "Look at this.My score is 3.14, and I am happy about it.". Now the array is

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3,
[7]=>14,
[8]=>and, ....

Then also the 3.14 is divided in 3 and 14 which should not happen in my case. I mean point should divide two strings but not two numbers. It should be like:

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3.14,
[7]=>and, ....

But I have no Idea how to avoid this cases!

Anybody any idea how to solve this problem?

Thanx, Granit

Was it helpful?

Solution

Or use regex :)

<?php
$str = "Look at this.My score is 3.14, and I am happy about it.";

// alternative to handle Marko's example (updated)
// /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/

var_dump(preg_split('/([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/',
                    $str, null, PREG_SPLIT_NO_EMPTY));

array(13) {
  [0]=>
  string(4) "Look"
  [1]=>
  string(2) "at"
  [2]=>
  string(4) "this"
  [3]=>
  string(2) "My"
  [4]=>
  string(5) "score"
  [5]=>
  string(2) "is"
  [6]=>
  string(4) "3.14"
  [7]=>
  string(3) "and"
  [8]=>
  string(1) "I"
  [9]=>
  string(2) "am"
  [10]=>
  string(5) "happy"
  [11]=>
  string(5) "about"
  [12]=>
  string(2) "it"
}

OTHER TIPS

Take a look at strtok. It lets you change the parsing tokens dynamically, so you can break the string apart manually in a while loop, pushing each split off word into an array.

My first idea was preg_match_all('/\w+/', $string, $matches); but that gives a similar result to the one you've got. The problem is that the numbers separated by a dot is very ambiguous. It can mean both decimal point and end of sentence so we need a way to change the string in such a way to eliminate the double meaning.

For example in this sentence we have several parts that we'd like to keep as one word: "Look at this.My score is 3.14, and I am happy about it. It's not 334,3 and today's not 2009-12-12 11:12:13.".

We start by building a search->replace dictionary to encode the exceptions into something that's not going to get split:

$encode = array(
    '/(\d+?)\.(\d+?)/' => '\\1DOT\\2',
    '/(\d+?),(\d+?)/' => '\\1COMMA\\2',
    '/(\d+?)-(\d+?)-(\d+?) (\d+?):(\d+?):(\d+?)/' => '\\1DASH\\2DASH\\3SPACE\\4COLON\\5COLON\\6'
);

Next, we encode the exceptions:

foreach ($encode as $regex => $repl) {
    $string = preg_replace($regex, $repl, $string);
}

Split the string:

preg_match_all('/\w+/', $string, $matches);

And convert the encoded word back:

$decode = array(
    'search' =>  array('DOT', 'COMMA', 'DASH', 'SPACE', 'COLON'),
    'replace' => array('.',   ',',     '-',    ' ',     ':'    )
);
foreach ($matches as $k => $v) {
    $matches[$k] = str_replace($decode['search'], $decode['replace'], $v);
}

$matches now contains the original sentence split into words with the right exceptions.

You can make the regex used in exceptions as simple or as complex as you like, but some ambiguity is always going to get through, for example two sentances with the first one ending and the next one beginning with a number: Number of the counting shall be 3.3 only and nothing but the 3.5 is right out..

Use ". ", instead of ".", in $delimiterList.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top