Question

I have a file with space-separated values, and I need to change this to comma-separated values. But I have some string columns which are quoted in "" (double quotes), and I may have spaces in the string columns. I need to suppress the "'s and need to make the file with , as delimiter.

Could you please help to convert this using Unix (shell) scripting?

Sample Data:

abcd "Bala Chuppala" 1 200 "" "Norway" "" ? ? 9 88‏     
ab "Joh Tanni S V S" 200 2 ? "Swiss" 1 100 200 ? 

Expected Output:

abcd,Bala Chuppala,1,200,,Norway,,?,?,9,88‏     
ab,Joh Tanni S V S,200,2,?,Swiss,1,100,200,? 
Was it helpful?

Solution 2

This is somewhat ugly, as my C skills are rusty, but it works pretty well...

Save as redelimit.c and compile it like this:

gcc -o redelimit redelimit.c

or

cc -o redelimit redelimit.c

and then run like this

./redelimit

If you want to save the output, do this:

./redelimit > newfile.csv

It expects the input file to be called input.csv

#include <stdio.h>
#include <string.h>

FILE *fp;

int main()
{
   int i,n;
   int inquotes;
   char line[1024];

   fp = fopen ("input.csv", "r");

   /* Loop through all lines in file */
   while(fgets(line, sizeof(line), fp) != NULL)
   {
      /* Remember if we are inside double quotes so we know what to do with spaces */
      inquotes=0;

      /* Parse each character in line */
      int len = strlen(line);
      for(i=0;i<len;i++){

         /* If these are double quotes, toggle value of "inquotes" variable */
         if(line[i]=='"'){
           inquotes=1-inquotes;
           continue;
         }

         if(line[i]==' '){
            if(inquotes){putchar(' ');} else {putchar(',');}
            continue;
         }

         putchar(line[i]);
      }
   }
   fclose(fp);
}

Output

abcd,Bala Chuppala,1,200,,Norway,,?,?,9,88
ab,Joh Tanni S V S,200,2,?,Swiss,1,100,200,?,

OTHER TIPS

In many ways, the custom C program shown by Mark Setchell is a good solution; it is concise and to the point and relatively easy to use. (It would be easier if it took its input from standard input, or if it took file name arguments and read from standard input if there were no file names. Hard-coded file names are very seldom a good idea for a general purpose tool).

If you're going to try to do it with standard tools, you're going to be using regular expressions. At first glance, your tools of choice include sed, awk, Perl and Python. If you use sed, it needs to be a version with extended regular expression support (at least allowing alternatives, |); I don't think you can do the mapping safely without it. It also turns out that the sub or gsub functions in awk are not really powerful enough; they do not support 'remembered strings' in the replacement other than the whole matching string.

What do you need to do?

  • Any sequence of zero or more 'not a double quote or space' followed by a space maps to a comma-terminated field (noting that if there is a comma in the input data, there'll be an extra field in the output because of the interloper).
  • Any sequence of a double quote followed by zero or more 'not a double quote' and then a double quote and a space maps to the 'zero or more' part followed by a comma.
  • Arguably, if the second double quote is not followed by a space, there is a format error — unless there's an escaping rule such as "Johnny ""The Singer"" Cholmondeley" maps to Johnny "The Singer" Cholmondeley. Even then, only another double quote would be valid, strictly.
  • A double quoted string at the end of the line maps to the unquoted string.

Ignoring pre-existing commas and embedded double quotes, it turns out to be easiest to do the replacement in two steps:

  1. Replace appropriate blanks by commas.
  2. Remove surrounding double quotes.

For example, in Perl or in sed with ERE support:

perl -p -e 's/([^ "]*|"[^"]*") /\1,/g; s/"([^"]*)"/\1/g' "$@"
sed  -E -e 's/([^ "]*|"[^"]*") /\1,/g; s/"([^"]*)"/\1/g' "$@"   # Mac OS X, BSD
sed  -r -e 's/([^ "]*|"[^"]*") /\1,/g; s/"([^"]*)"/\1/g' "$@"   # GNU

A Python solution is more verbose (or, at least, with my level of knowledge of Python, it is):

#!/usr/bin/python
from __future__ import print_function
import re
import fileinput

ssv = re.compile(r'([^ "]*|"[^"]*") ')
qqv = re.compile(r'"([^"]*)"')

for line in fileinput.input():
    line = ssv.sub(r'\1,', line)
    line = qqv.sub(r'\1', line)
    print(line, end='');

Note that the script works with Python 2 (2.7 tested, and nominally 2.6, but not earlier as the future import was not available earlier) as well as Python 3 (3.4 tested).

I did the same thing in awk in case you have a phobia against compilers :-)

#!/usr/bin/awk -f
{
    inq=0
    len=length($0)
    for(i=1;i<=len;i++){
      c=substr($0,i,1)
      if(c=="\""){inq=1-inq}
      if(c==" "){
         if(inq==1)
         {
            printf " "
         } else {
            printf ","
         }
      } else {
         printf "%s",c
      }
    }
    printf "\n"
}

Save as marksscript and run it like this

chmod +x marksscript
./marksscript < input.csv
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top