Question

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:

Two words,Word,Word,Word,"Number, number"

What I need to do is format it like this...

"Two words","Word",Word","Word","Number, number"

I have had a RegEx pattern of

s/,/","/g

working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.

I managed to write up

s/,[A-Za-z0-9]/","/g

which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that

s/(,)[A-Za-z0-9]\b

should work, but it doesn't.

Anyone have an idea?

Was it helpful?

Solution

s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.

Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.

If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.

Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.

OTHER TIPS

My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.

This question is similar to: Replace patterns that are inside delimiters using a regular expression call.

This could work:

s/"([^"]*)"|([^",]+)/"$1$2"/g

Looks like you're using Sed.

While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.

Through sed, your command would be:

  sed 's/[ \"]*,[ \"]*/\", \"/g'

Note that you'll still have to put doublequotes at the beginning and end of the string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top