Mindboggling Regular Expression to convert Whitespace-Comma-Whitespace input string to an array. Quoting must be supported

StackOverflow https://stackoverflow.com/questions/18679255

Question

Here is my best attempt (so far) to solve this issue. I'm new to regular expressions and this problem is pretty substantial, but I'll give it a try. RegEx's clearly take some time to master.

This seems to satisfy the delimiter/comma requirements. To me it seems redundant though because of the repeated /s*. There is likely a better way.

/\s*[,|\s*]\s*/

I found this on SOF and am trying to tear it apart and apply it to my problem (not easy). This seems to satisfy most of the "quoting" requirements, but I'm still working on how to solve the delimiter issues in the requirements below.

/"(?:\\\\.|[^\\\\"])*"|\S+/

The requirements I'm trying to meet:

  • Will be used by the PHP preg_match_all() (or similar) function to break a string into an array of strings. Source language is PHP.
  • Words in the input string are delimited by (0 or more whitespace)(optional comma)(0 or more whitespace) or just (1 or more whitespace).
  • The input string can also have quoted substrings which become a single element in the output array.
  • Quoted substrings in the input string must retain their double quotes when placed in the output array (because we must be able to identify them later as being originally quoted in the input string).
  • Leading and trailing whitespace (that is, whitespace between the double-quote character and the string itself) in quoted substrings must be removed when placed into the output array. Example: "<space>hello<space>world<space><tab>" becomes "hello<space>world"
  • Whitespace within quoted phrases in the input string must be reduced to a single space when placed into its output array element. Example: "hello<space><tab><space><space>world" becomes "hello<space>world"
  • Quoted substrings in the input string that are zero-length or contain only whitespace are not placed into the output array (The output array must not contain any zero-length elements).
  • Each element of the output array must be trimmed (left and right) for whitespace.

This example demonstrates all requirements above:

Input String:

"" one " two     three " four  ,  five "   six seven " " "

Returns this array (double quotes actually exist in the strings shown below):

{one,"two three",four,five,"six seven"}

EDIT 9/13/2013

I have been studying regular expressions hard for a couple days and finally settled on this proposed solution. It may not be the best, but it's what I have at this time.

I will use this regex to split the search string into an array using PHP's preg_match_all() function:

/(?:"([^"]*)"|([^\s",]+))/

The leading/trailing "/" is required by the php function preg_match_all().

Now that the array is created, we retrieve it from the function call like this:

$x = preg_match_all(REGEX);
$Array = $x[0];

We have to do this because the function returns a compound array and element 0 contains the actual output of the regex. The other returned elements contain values captured by the regex, which we don't need.

Now, I will iterate the resulting array and process each element to meet the requirements (above), which will be much easier than meeting all the requirements in a single step using single regex.

Was it helpful?

Solution

I finally have developed a solution for this problem which involved a few PHP statements utilizing regular expressions. Below is the final function.

This function is part of a class which is why it begins with "public".

public function SearchString_ToArr($SearchString) {
    /*
    Purpose
        Used to parse the specified search string into an array of search terms.
        Search terms are delimited by <0 or more whitespace><optional comma><0 or more whitespace>
    Parameters
        SearchString (string) = The search string we're working with.
    Return (array)
        Returns an array using the following rules to parse the specified search string:
            - Each search term from the search string is converted to a single element in the returned array.
            - Search terms are delimited by whitespace and/or commas, or they may be double quoted.
            - Double-quoted search terms may contain multiple words.
        Unquoted Search Terms:
            - These are delimited by any number of whitespace characters or commas in the search string.
            - These have all leading and trailing whitespace trimmed.
        Quoted Search Terms:
            - These are surrounded by double-quotes in the search string.
            - These retain leading and trailing double-quotes in the returned array.
            - These have all leading and trailing whitespace trimmed.
            - These may contain whitespace.
            - These have all containing whitespace converted into a single space.
            - If these are zero-length or contain only whitespace, they are not included in the returned array.
        Example 1:
            SearchString =  ' "" one " two   three " four "five six" " " '
            Returns {"one", ""two three"", "four", ""five six""}
            Notes   The leading whitespace before the first "" is not returned.
                    The first quoted phrase ("") is empty so it is not returned.
                    The term "one" is returned with leading and trailing whitespace removed.
                    The phrase "two three" is returned with leading and trailing whitspace removed.
                    The phrase "two three" has containing whitespace converted to a single space.
                    The phrase "two three" has leading and trailing double-quotes retained.
                    ...
    Version History
        1.0 2013.09.18 Tested by Russ Tanner on PHP 5.3.10.
    */

    $r = array();
    $Matches = array();

    // Split the search string into an array based on whitespace, commas, and double-quoted phrases.
    preg_match_all('/(?:"([^"]*)"|([^\s",]+))/', $SearchString, $Matches);
    // At this point:
    //  1. all quoted strings have their own element and begin/end with the quote character.
    //  2. all non-quoted strings have their own element and are trimmed.
    //  3. empty strings are omitted.

    // Normalize quoted elements...
    // Convert all internal whitespace to a single space.
    $r = preg_replace('/\s\s+/', ' ', $Matches[0]);
    // Remove all whitespace between the double-quotes and the string.
    $r = preg_replace('/^"\s+/', '"', $r);
    $r = preg_replace('/\s+"$/', '"', $r);

    return $r;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top