EDIT:
Hey user84771,
So I reworked my answer completely based on what you said. It has a couple more lines in it, but hopefully this is what you're looking for:
In order to find the largest row from 'each ID' similar to a group by in Mysql, I would do the following.
Given the following text file:
[root@dev7 ~]# cat stackoverflow2.log
ID1, fdsgfdsggfdsgsdfg
ID1, fdsgsdfg
ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg
ID1, fdsgsdfg
ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh
ID2, fsfgsdgf
ID3, fdgfdgdgfdggfdg
[root@dev7 ~]#
I'd do the following:
_DATAFILE=stackoverflow2.log
_KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq | sed "s,\,,,g" | xargs )
_LARGEST_PER_KEY=""
echo $_KEYS
for i in ${_KEYS}; do
_LARGEST_PER_KEY="${_LARGEST_PER_KEY}\n$(grep "$i" ${_DATAFILE} | uniq | awk '{ print length ":", $0 }' | sort -n -u | tail -1 | cut -d ":" -f2 | awk '{ $1=$1; print}')"
done;
echo -e ${_LARGEST_PER_KEY}
To explain whats happening.
- _DATAFILE - This variable is your input file.
- _KEYS - This variable returns all of the keys within the first column ( uniq and sorted w/o associated data). I used xargs to make
sure all of the keys are put into a straight line for the next step.
[root@dev7 ~]# _KEYS=$(awk '{ $1=$1; print $1}' ${_DATAFILE} | uniq |
sed "s,\,,,g" | xargs )
[root@dev7 ~]# echo $_KEYS
ID1 ID2 ID3
_LARGEST_PER_KEY - This variable is going to be used for your result when we're done. We define it here before the for loop.
The for loop performs a grep for the key in quest ( eg ID1 ) then performs my form line of code to figure out which one contains the longest data value, and performs a numeric/uniq sort to see which one is the largest. We grab that value using tail and append it to our _LARGEST_PER_KEY string. ( note: we add \n characters as delimiters )
ONCE THE for loop finishes, we then echo out the results using echo -e to ensure that the newline characters get evaluated correctly on the screen:
[root@dev7 ~]# echo -e ${_LARGEST_PER_KEY}
ID1, fdsgfdgdsfgdsgsdfgdgffdsgfsdg
ID2, fdgsfdsgfdshshdsfhdfghdsfhdfhdshsdfhsfdh
ID3, fdgfdgdgfdggfdg
Note: since we sorted everything in the beginning, there should be no reason to sort again.
Clarification notes:
awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )
uniq - Gets rid of the duplicates
awk '{ print length ":", $0 }' - Gets the line length of each line, prints it out with "lenghth of line" : "line test"
sort -n -u - numeric sort ( largest number is the last item ). Also ensures that the entire file is sorted uniquely if the datafile
arrives unsorted. Thanks for the tip
Glenn.
tail -1 - Grab's the last line since its the largest
cut -d ":" -f2 - If you only want the exact line, get rid of the length of the line simply return the line
awk '{ $1=$1; print}' - This removes trailing white spaces ( beginning of line / end of line )
Again, im sure theres a way to do this that is a bit more efficient, but this is what I was able to come up with. Hope this helps!