Question

I have a large csv export where the columns do not align because some values are accidentally put in multiple cells instead of one. Fortunately, the values lay between two unique strings. I am hoping to use regex to merge these values into one cell. Sample data is as follows:

"apple","NULL","0","0","0",",","1",",","fruit","red","sweet","D$","object"
"horse","NULL","0","0","0",",","1",",","animal","large","tail","D$","object"
"Los Angeles","NULL","0","0","0",",","1",","city","California","smoggy","entertainment","D$","location"

The unmerged values begin after

"NULL","0","0","0",",","1",",","

And the unmerged values end before

","D$"

I'm trying to figure out a regex that would remove the "," between the values to merge them, so the output would look like:

"apple","NULL","0","0","0",",","1",",","fruit,red,sweet","D$","object"
"horse","NULL","0","0","0",",","1",",","animal,large,tail","D$","object"
"Los Angeles","NULL","0","0","0",",","1",",","city,California,smoggy,entertainment","D$","location"
Was it helpful?

Solution

You can do that:

$pattern = '~(?:"NULL","0","0","0",",","1",",","|(?!^)\G)[^"]+\K","(?!D\$)~';
$csv = preg_replace($pattern, ',', $csv);

pattern details:

~             # delimiter
(?:
    "NULL","0","0","0",",","1",",","
  |           
    (?!^)\G   # anchor for the end of the last match
)
[^"]+         # content between quotes
\K            # removes all on the left from match result
","           # ","
(?!D\$)       # not followed by D$
~

The idea of the pattern is to use the \G anchors that means "start of the string" or "end of the last match". I added (?!^) to avoid the first case.

"NULL","0","0","0",",","1",","," is used as an entry point for the first match. Then the content between quotes is matched. Since the \K removes all on the left from the match result, only "," is replaced.

The next matches use \G as entry point and the contiguous matches continue until (?!D\$) succeeds.

OTHER TIPS

The best I was able to do in RegEx is just match the entire string of values, but not get them into capturing groups. This means I wasn't able to just match/replace without a callback function. Depending on your language, you'll have to do this differently, but I'll show an example in PHP. Here is the regex:

(?<="NULL","0","0","0",",","1",",)(?:"[^"]+",?)+(?=,"D\$")

First, we start by looking behind ((?<=...)) for your "NULL","0","0","0",",","1",", string. Then we use a non-capturing repeated group ((?:...)+) that will catch 1+ CSV columns. The syntax inside matches ", followed by 1+ non-" characters, followed by " and an optional ,. Findally, we look ahead ((?=...)) for your ,"D\$" string which ends the list of words.

Given this string:

"apple","NULL","0","0","0",",","1",","fruit","red","sweet","D$","object"

It will match:

"fruit","red","sweet"

In PHP, I used preg_replace_callback() to loop through each match and then I replace all instances of "," with ,. When $csv equals your sample data, this gives you your intended output.

$csv = preg_replace_callback(
    '/(?<="NULL","0","0","0",",","1",",)(?:"[^"]+",?)+(?=,"D\$")/',
    function($matches) {
        return str_replace('","', ',', reset($matches));
    },
    $csv
);

Output:

"apple","NULL","0","0","0",",","1",","fruit,red,sweet","D$","object"

"horse","NULL","0","0","0",",","1",","animal,large,tail","D$","object"

"Los Angeles","NULL","0","0","0",",","1",","city,California,smoggy,entertainment","D$","location"


Note: The reason I don't think I am able to do this in one simple regex replace is because (to my knowledge) regex isn't good at capturing X groups. If, for instance, we replaced the repeated non-capturing group with something like (?:"([^"]+)",?)+ (added a capture group around the word, [^"]+), it would still only count as 1 captured group. See this example to see what I mean. You could literally repeat that non-capturing group, and make each one after the first optional with ?. However, you'd have to include at least as many as your largest example (see here).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top