union of two columns of a tsv file

https://stackoverflow.com/questions/19020255

29-06-2022
|

Question

I've a file which stores a directed graph. Each line is represented as

node1 TAB node2 TAB weight

I want to find the set of nodes. Is there a better way of getting union? My current solution involves creating temporary files:

cut -f1 input_graph | sort | uniq > nodes1
cut -f2 input_graph | sort | uniq > nodes2
cat nodes1 nodes2 | sort | uniq > nodes

Solution

{ cut -f1 input_graph; cut -f2 input_graph; } | sort | uniq

No need to sort twice.

The { cmd1; cmd2; } syntax is equivalent to (cmd1; cmd2) but may avoid a subshell.

In another language (e.g. Perl), you could slurp the first column in a hash and then process the second column sequentially.

With Bash only, you can avoid temporary files by using the syntax cat <(cmd1) <(cmd2). Bash takes care of creating temporary file descriptors and setting up pipelines.

In a script (where you may want to avoid requiring bash), if you end up needing temporary files, use mktemp

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow