Exclude e-mails which domain name match with the global one
Pregunta
The global domain are in "*@" option, when e-mail match with one of these global domains, I need to exclude them from the list.
Example:
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@superuser.com
WF,test@stackapps.com
WF,test@stackexchange.com
Output:
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
Solución
$ awk -F, 'NR==FNR && /\*@/{a[substr($2,3)]=1;print;next}NR!=FNR && $2 !~ /^\*/{x=$2;sub(/.*@/,"",x); if (!(x in a))print;}' OFS=, file file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
Otros consejos
You have two types of data in the same file, so the easiest way to process is to divide it first:
<infile tee >(grep '\*@' > global) >(grep -v '\*@' > addr) > /dev/null
Then use global
to remove information from addr
:
grep -vf <(cut -d@ -f2 global) addr
Putting it together:
<infile tee >(grep '\*@' > global) >(grep -v '\*@' > addr) > /dev/null
cat global <(grep -vf <(cut -d@ -f2 global) addr) > outfile
Contents of outfile
:
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
Clean up temporary files with rm global addr
.
You could do:
grep -o "\*@.*" file.txt | sed -e 's/^/[^*]/' > global.txt
grep -vf global.txt file.txt
This will start by extracting the global emails, and prepend them with [^*]
, saving the results into global.txt
. This file is then used as input to grep, where each line is treated as a regex in the form [^*]*@global.domain.com
. The -v
option tells grep to only print lines that don't match that pattern.
Another analogous option, using sed for in-place editing would be:
grep -o "\*@.*" file.txt | sed -e 's/^.*$/\/[^*]&\/d/' > global.sed
sed -i -f global.sed file.txt
Here's one way using GNU awk
. Run like:
awk -f script.awk file.txt{,}
Contents of script.awk
:
BEGIN {
FS=","
}
FNR==NR {
if (substr($NF,1,1) == "*") {
array[substr($NF,2)]++
}
next
}
substr($NF,1,1) == "*" || !(substr($NF,index($NF,"@")) in array)
Results:
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
Alternatively, here's the one-liner:
awk -F, 'FNR==NR { if (substr($NF,1,1) == "*") array[substr($NF,2)]++; next } substr($NF,1,1) == "*" || !(substr($NF,index($NF,"@")) in array)' file.txt{,}
With one pass of the file and allowing for the global domains to be intermixed with the addresses:
$ cat file
WF,*@stackoverflow.com
WF,test@superuser.com
WF,*@superuser.com
WF,test@stackapps.com
WF,test@stackexchange.com
WF,*@stackexchange.com
WF,foo@stackapps.com
$
$ awk -F'[,@]' '
$2=="*" { glbl[$3]; print; next }
{ addrs[$3] = addrs[$3] $0 ORS }
END {
for (dom in addrs)
if (!(dom in glbl))
printf "%s",addrs[dom]
}
' file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
WF,foo@stackapps.com
or if you don't mind a 2-pass approach:
$ awk -F'[,@]' '(NR==FNR && $2=="*" && !glbl[$3]++) || (NR!=FNR && !($3 in glbl))' file file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
WF,foo@stackapps.com
I know that second one's a bit cryptic, but it's pretty easily translated to not use the default action and a good exercise in awk idioms :-).
This might work for you (GNU sed):
sed '/.*\*\(@.*\)/!d;s||/[^*]\1/d|' file | sed -f - file