Question

For example, I want to split this sentence:

I am a sentence.

Into an array with 5 parts; I, am, a, sentence, and ..

I'm currently using preg_split after trying explode, but I can't seem to find something suitable.

This is what I've tried:

$sentence = explode(" ", $sentence);
/*
returns array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}
*/

And also this:

$sentence = preg_split("/[.?!\s]/", $sentence);
/*
returns array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(0) ""
}
*/

How can this be done?

Was it helpful?

Solution

You can split on word boundaries:

$sentence = preg_split("/(?<=\w)\b\s*/", 'I am a sentence.');

Pretty much the regex scans until a word character is found, then after it, the regex must capture a word boundary and some optional space.

Output:

array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(1) "."
}

OTHER TIPS

I was looking for the same solution and landed here. The accepted solution does not work with non-word characters like apostrophes and accent marks and so forth. Below, find the solution that worked for me.

Here is my test sentence:

Claire’s favorite sonata for piano is Mozart’s Sonata no. 15 in C Major.

The accepted answer gave me the following results:

Array
(
    [0] => Claire
    [1] => ’s
    [2] => favorite
    [3] => sonata
    [4] => for
    [5] => piano
    [6] => is
    [7] => Mozart
    [8] => ’s
    [9] => Sonata
    [10] => no
    [11] => . 15
    [12] => in
    [13] => C
    [14] => Major
    [15] => .
)

The solution I came up with follows:

$parts = preg_split("/\s+|\b(?=[!\?\.])(?!\.\s+)/", $sentence);

It gives the following results:

Array
(
    [0] => Claire’s
    [1] => favorite
    [2] => sonata
    [3] => for
    [4] => piano
    [5] => is
    [6] => Mozart’s
    [7] => Sonata
    [8] => no.
    [9] => 15
    [10] => in
    [11] => C
    [12] => Major
    [13] => .
)

If anyone is interested in an simple solution which ignores punctuation

preg_split( '/[^a-zA-Z0-9]+/', 'I am a sentence' );

would split into

array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
}

Or an alternative solution where the punctuation is included in the adjacent word

preg_split( '/\b[^a-zA-Z0-9]+\b/', 'I am a sentence.' );

would split into

array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top