Trying to parse values from a TSV file into 2 matching Bash arrays

https://stackoverflow.com/questions/22800122

25-06-2023
|

Question

Unit Title      Class Title         File Name

Unit Title1     Title1              Filename1

Unit Title2     Title2              Filename2
                Title3              Filename3
                Title4              Filename4
                Title5              Filename5

Unit Title3     Title6              Filename6
                Title7              Filename7
                Title8              Filename8
                Title9              Filename9

Unit Title4     Title10             Filename10
                Title11             Filename11
                Title12             Filename12

I have a large amount of TSV (tab-separated values) files that have a structure like this. I'm trying to write a bash script that can parse these files into matching arrays. It's the empty lines that are throwing me for a loop. I need to be able to list out a class title while also listing which "Unit Title" it falls under.

I've can get each of the groups into their own arrays, but I can't duplicate the entries in "Unit Titles" to line up with the Class Titles. Can someone help get me pointed in the right direction? Thanks!

Solution

It's unclear to me exactly what you want the arrays to look like, but perhaps pre-processing the input files to have all columns filled in helps:

awk -F'\t' -v OFS='\t' '
  $0 != "" {  # process only non-empty lines
      # If field 1 is empty, set it to the most recent unit title.
    if ($1 != "") ut=$1; else $1=ut;
      # Print the (rebuilt) line.
    print
  }' tsvfile

This will result in something like (\t represents a literal tab), which should make parsing easier:

Unit Title1\tTitle1\tFilename1
Unit Title2\tTitle2\tFilename2
Unit Title2\tTitle3\tFilename3
Unit Title2\tTitle4\tFilename4
Unit Title2\tTitle5\tFilename5
Unit Title3\tTitle6\tFilename6
Unit Title3\tTitle7\tFilename7
...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow