Question

I'm doing a test with these files:

comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R2_001.fastq

I want to get the files that have the same code until the first _ (underscore) and have the code R1 in different output files. The output files should be called according with the code until the first _ (underscore).

-This is my code, but I'm having trouble on making the output files.

#!/bin/bash

for i in {900..995}; do
    if [[ ${i} -eq ${i} ]]; then
        cat comp${i}_*_R1_001.fastq
    fi
done

-I want to have two outputs:

One output will have all lines from:

comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq

and its name should be comp900_R1.out

The other output will have lines from:

comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq

and its name should be comp995_R1.out

Finally, as I said, this is a small test. I want my script to work with a lot of files that have the same characteristics.

Was it helpful?

Solution

Using awk:

ls -1 *.fastq | awk -F_ '$8 == "R1" {system("cat " $0 ">>" $1 "_R1.out")}'

List all files *.fastq into awk, splitting on _. Check if 8:th part $8 is R1, then append cat >> the file into first part $1 + _R1.out, which will be comp900_R1.out or comp995_R1.out. It is assumed that no filenames contain spaces or other special characters.

Result:

File comp900_R1.out containing all lines from

comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq

and file comp995_R1.out containing all lines from

comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq

OTHER TIPS

My stab at a general solution:

#!/bin/bash

for f in *_R1_*; do
   code=$(echo $f | cut -d _ -f 1)
   cat $f >> ${code}_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
done

Iterates over files with _R1_ in it, then appends its output to a file based on code.

cut pulls out the code by splitting the filename (-d _) and returning the first field (-f 1).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top