Question

I have a mixed character separated file with a header row I am trying to read using Text::CSV, which I have used successfully on comma separate files to pull into an array of hashes in other scripts. I have read Text::CSV does not support multiple separators (spaces, tabs, commas), so I was trying to clean up the row using regex before using Text::CSV. Not to mention the data file also has comment lines in the middle of the file. Unfortunately, I do not have admin rights to install libraries which may accommodate multiple sep_chars, so I was hoping I could use Text::CSV or some other standard methods to clean up the header and row before adding to the AoH. Or should I abandon Text::CSV?

I'm obviously still learning. Thanks in advance.

Example file:

#
#
#
# name scale     address      type
test.data.one   32768       0x1234fde0      float
test.data.two   32768               0x1234fde4      float
test.data.the   32768       0x1234fde8      float
# comment lines in middle of data
test.data.for   32768                 0x1234fdec      float
test.data.fiv   32768       0x1234fdf0      float

Code excerpt:

my $fh;
my $input;
my $header;
my $pkey;
my $row;
my %arrayofhashes;   

my $csv=Text::CSV({sep_char = ","})
    or die "Text::CSV error: " Text::CSV=error_diag;

open($fh, '<:encoding(UTF-8)', $input)
    or die "Can't open $input: $!";

while (<$fh>) {
    $line = $_;
    # skip to header row
    next if($line !~ /^# name/);
    # strip off leading chars on first column name
    $header =~ s/# //g;
    # replace multiple spaces and tabs with comma
    $header =~ s/ +/,/g;
    $header =~ s/t+/,/g;
    # results in $header = "name,scale,address,type"
    last;
}

my @header = split(",", $header);
$csv->parse($header);
$csv->column_names([$csv->fields]);
# above seems to work!

$pkey = 0;
while (<$fh>) {
    $line = $_;
    # skip comment lines
    next if ($line =~ /^#/);
    # replace spaces and tabs with commas
    $line =~ s/( +|\t+)/,/g;
    # replace multiple commas from previous regex with single comma    
    $line =~ s/,+/,/g;
    # results in $line = "test.data.one,32768,0x1234fdec,float"

    # need help trying to create a what I think needs to be a hash from the header and row.
    $row = ?????;
    # the following line works in my other perl scripts for CSV files when using:
    # while ($row = $csv->getline_hr($fh)) instead of the above.  
    $arrayofhashes{$pkey} = $row;
    $pkey++;
}
Was it helpful?

Solution

If your columns are separated by multiple spaces, Text::CSV is useless. Your code contains a lot of repeated code, trying to work around of Text::CSV limitations.

Also, your code has bad style, contains multiple syntax errors and typos, and confused variable names.

So You Want To Parse A Header.

We need a definition of the header line for our code. Let's take “the first comment line that contains non-space characters”. It may not be preceded by non-comment lines.

use strict; use warnings; use autodie;

open my $fh, '<:encoding(UTF-8)', "filename.tsv";  # error handling by autodie

my @headers;
while (<$fh>) {
  # no need to copy to a $line variable, the $_ is just fine.
  chomp;                                     # remove line ending
  s/\A#\s*// or die "No header line found";  # remove comment char, or die
  /\S/ or next;                              # skip if there is nothing here
  @headers = split;                          # split the header names.
                                             # The `split` defaults to `split /\s+/, $_`
  last;                                      # break out of the loop: the header was found
}

The \s character class matches space characters (spaces, tabs, newlines, etc.). The \S is the inverse and matches all non-space characters.

The Rest

Now we have our header names, and can proceed to normal parsing:

my @records;
while (<$fh>) {
  chomp;
  next if /\A#/;              # skip comments
  my @fields = split;
  my %hash;
  @hash{@headers} = @fields;  # use hash slice to assign fields to headers
  push @records, \%hash;      # add this hashref to our records
}

Voilà.

The Result

This code produces the following data structure from your example data:

@records = (
  {
    address => "0x1234fde0",
    name    => "test.data.one",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fde4",
    name    => "test.data.two",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fde8",
    name    => "test.data.the",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fdec",
    name    => "test.data.for",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fdf0",
    name    => "test.data.fiv",
    scale   => 32768,
    type    => "float",
  },
);

This data structure could be used like

for my $record (@records) {
  say $record->{name};
}

or

for my $i (0 .. $#records) {
  say "$i: $records[$i]{name}";
}

Criticism Of Your Code

  • You declare all your variables at the top of your script, effectively making them global variables. Don't. Create your variables in the smallest scope possible. My code uses just three variables in the outer scope: $fh, @headers and @records.

  • This line my $csv=Text::CSV({sep_char = ","}) doesn't work as expected.

    • Text::CSV is not a function; it is the name of a module. You meant Text::CSV->new(...).
    • The options should be a hashref, but sep_char = "," tries to assign something to sep_char sadly, this could be valid syntax. But you actually meant to specify a key-value relationship. Use the => operator instead (called fat comma or hash rocket).
  • Neither does this work: or die "Text::CSV error: " Text::CSV=error_diag.

    • To concatenate strings, use the . concatenation operator. What you wrote is a syntax error: A literal string is always followed by an operator.
    • You really like assignments? The Text::CSV=error_diag does not work. You intended to call the error_diag method on the Text::CSV class. Therefore, use the correct operator ->: Text::CSV->error_diag.
  • The substitution s/t+/,/g replaces all sequences of ts by commas. To replace tabs, use the \t charclass.

  • %arrayofhashes is not an array of hashes: It is a hash (as evidenced by the % sigil), but you use integer numbers as keys. Arrays have the @ sigil.

  • To add something to the end of an array, I'd rather not keep the index of the last item in an extra variable. Rather, use the push function to add an item to the end. This reduces the amount of bookkeeping code.

  • if you find yourself writing a loop like my $i = 0; while (condition) { do stuff; $i++}, then you usually want to have a C-style for loop:

    for (my $i = 0; condition; $i++) {
      do stuff;
    }
    

    This also helps with proper scoping of variables.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top