grep based on first column

https://stackoverflow.com/questions/23591927

linux
grep

20-07-2023
|

Вопрос

I have a big data file called fileA having the following format

col1    0.1111,0.2222,0.33333,0.4444
col5    0.1111,0.2222,0.33333,0.4444
col3    0.1111,0.2222,0.33333,0.4444
col4    0.1111,0.2222,0.33333,0.4444

The separator between 1st and 2nd columns is \t. Other separators are comma. I have another file containing the name of rows I am interested in, called fileB, which looks like:

col3
col1
...

Both files are not sorted. I want to retrieve all the rows from fileA with names appearing in fileB. The code grep -f fileB fileA does this job, but I think it will search all fileds in fileA, which takes long time. How can I specify only to search the 1st column in fileA?

Решение

join <(sort -t $'\t' -k 1 fileA) <(sort -t $'\t' -k 1 fileB)

Files are sorted in O(n.log(n)+p.log(p)) then they're merged in O(n+p), I don't think we can do better than that.

EDIT Ok, we can do better with a hash table which will be O(n+p).

Другие советы

linear time O(n) solution without sorting. (I didn't test, hope no typo):

awk -F'\t' 'NR==FNR{a[$0]=7;next}a[$1]' fileB fileA

note that the get operation on a hashtable is considered as O(1)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow