Question

Here's an interesting problem: I have a generic price file with ID#, Description and Price to import that comes in as a comma delimited file (CSV or TSV) from a variety of vendors. One of the vendors uses a comma in their Description field. The problem is in the import thinks that every comma defines a new column and throws off the record. (It would be easy to deal with if the import file were fixed length, but alas it is not.)

Question: Can anyone think of how to deal with a comma in Description? I'd like to replace the comma with a period or hyphen, which would be acceptable.

Here's what the file looks like.

ID,Description,Price
1234,Good Part,1.23
2345,This is.ok,2.34
3456,Bad Part,with a comma,4.56

In the first and second record, there are 3 columns as it should be. In the third example, this results in 4 columns and throws off the import since it's looking for a currency in the 3rd column, but finds a string instead. I'm using Perl and Java script for the most part.

Was it helpful?

Solution

The most common solution is quoting fields that can contain "bad characters".

In this case:

3456,"Bad Part,with a comma",4.56

And in turn, if you happen to have " character inside you escape it with \ (and so you do with plain ).

OTHER TIPS

So, you have something that vaguely resembles a CSV file, but isn't. One thing you can do is close the gap and then process it normally -- everyone else has suggested ways of doing this. Another thing you can do is shrug and process it as it is, as something other than CSV.

Here, we have an ID at the beginning of the line, followed by a comma.

/^(\d+),/;

And then anything at all, followed by a comma:

/^(\d+),(.+),/

And then a price, followed by the end of the line:

/^(\d+),(.+),(\d+(?:\.\d+)?)$/

And yes, that (.+), in the middle works as you want with embedded commas. + is greedy, so this backtracks from right-to-left to find the first point that allows the rest of the pattern to match.

Altogether:

#! /usr/bin/env perl
use common::sense;

while (<DATA>) {
  next unless /^(\d+),(.+),(\d+(?:\.\d+)?)$/;
  say "ID: $1";
  say "Description: $2";
  say "Price: $3";
  say "----"
}

__DATA__
ID,Description,Price
1234,Good Part,1.23
2345,This is.ok,2.34
3456,Bad Part,with a comma,4.56

And, a bit neater (although the names are longer than what they name...):

#! /usr/bin/env perl
use common::sense;

while (chomp($_ = <DATA>)) {
  next if /
    ^ID,Description,Price\z  # allow only this header
    | ^\s*\z                 # and blank lines
    | ^\s*\#                 # and lines containing only a comment
  /xi;

  /^(?<ID> \d+),
    (?<Description> .+),
    (?<Price> \d+(?:\.\d+)?)
  \z/x or die "Invalid line: $_";

  say "$_: $+{$_}" for qw(ID Description Price);
  say "----";
}

__DATA__
ID,Description,Price
1234,Good Part,1.23
2345,This is.ok,2.34

# why do we allow this again?
id,description,price
3456,Bad Part,with a comma,4.56

Both output:

ID: 1234
Description: Good Part
Price: 1.23
----
ID: 2345
Description: This is.ok
Price: 2.34
----
ID: 3456
Description: Bad Part,with a comma
Price: 4.56
----

Yeah, you would need to change this regex to suit slightly different notCSV, but so would you also need to change your gap-closer. This is why notCSV is bad.

Based on your comment in depesz's answer, here is my effort to try to surround that field between double quotes. Then just use Text::CSV_XS or similar to parse it.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;

my ($f, $num_fields_h);

while ( <> ) { 
    chomp;

    ## Header:
    ## Get the position of the "Description" field and the total number
    ## of fields. I assume that header doesn't have the problem of commas
    ## in the middle.
    if ( $. == 1 ) { 
        my %h = do { my $i = 0; map { $_ => $i++ } split /,/ };
        $f = $h{ Description };
        $num_fields_h = (tr/,/,/) + 1;
        printf qq|%s\n|, $_; 
        next;
    }   

    ## Data lines:
    ## Split the line and join fields in three parts, the first one until the
    ## "Description" calculated in header. The second one from that position until
    ## the difference of fields between the header and this line. That number will
    ## be the number of commas in the description. The third one from that calculated
    ## position until the end.
    my @f = split /,/; 
    my $num_fields_d = (tr/,/,/) + 1;
    my $limit_description_field = $f + $num_fields_d - $num_fields_h;
    printf qq|%s\n|, 
        join q|,|, 
            @f[ 0 .. $f - 1 ],  
            q|"| . join( q|,|, @f[ $f .. $limit_description_field ] ) . q|"|, 
            @f[ ($limit_description_field + 1) .. $#f ];  
}

Run it like:

perl script.pl infile

That yields:

ID,Description,Price
1234,"Good Part",1.23
2345,"This is.ok",2.34
3456,"Bad Part,with a comma",4.56

how about this :

 $x='3456,Bad Part,with a comma,4.56';
 @y = split(/,/,$x);
 if ( $#y == 3 ) { 
    $desc = "$y[1],$y[2]";
 };   

If you know how many fields there are, and trust all but one of them, then you could parse the good parts from both ends, and whatever is left would be the bad field; i.e.

while(<>){
 m/(^[^,]+),(.+),([^,]+$)/;
 my @fields = ($1,$2,$3);
 $fields[1]=~s/,/-/g;
}

So the anchored parts at the beginning at the end won't contain a comma, but a middle field in between them can.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top