Excluding 'duplicates' in ls/find

https://stackoverflow.com/questions/21377082

03-10-2022
|

Question

I am using a program which outputs a lot of output files in the format:

run_1_0001.blah
run_1_0002.blah
run_2_0001.blah
run_3_param_2_0001.blah
run_3_param_2_0002.blah

Each run produces drops several thousand of these files into the same directory. The head of the filename is arbitrary and may contain numbers. The only consistently predictable part is that the filename ends with a 4-digit number and an extension. What I'd like is to write an alias which excludes these pseudo-duplicates and produces a single line of output for each collection of files. In the rubbish example I've given, the output would be:

run_1_.blah
run_2_.blah
run_3_param_2_.blah

Apologies if this is easy. I did have a look around but couldn't find anything.

Solution

Assuming that it is only the numbers that differ between the duplicates, you could delete them and pass the resulting output to uniq, e.g.

Create test files:

touch some_filename_0001.blah some_filename_0002.blah some_otherfilename_0001.blah

Delete numbers and pass to uniq:

ls | tr -d '[0-9]' | uniq

Output:

some_filename_.blah
some_otherfilename_.blah

Edit

Based on your updated test data and the fact that you want to use ls -la, I suggest using awk to parse the data. In my version of ls the filename is the 9th element in ls -la output, so something like this should work:

ls -la | awk '{ sub("[0-9]{4}", "", $9) } !h[$9]++'

This removes a sequence of four integers from the filename column and only prints it if it has not been seen before.

Caveats: This assumes that file names do not contain spaces. Also, "runs" and "parameters" should not consist of 4 or more integers, if that is the case you need to use a more advanced regular expression to anchor the substitution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow