Exclude e-mails which domain name match with the global one

https://stackoverflow.com/questions/13119804

15-07-2021
|

Pregunta

The global domain are in "*@" option, when e-mail match with one of these global domains, I need to exclude them from the list.

Example:

WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@superuser.com
WF,test@stackapps.com
WF,test@stackexchange.com

Output:

WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com

Solución

$ awk -F, 'NR==FNR && /\*@/{a[substr($2,3)]=1;print;next}NR!=FNR && $2 !~ /^\*/{x=$2;sub(/.*@/,"",x); if (!(x in a))print;}' OFS=, file file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com

Otros consejos

You have two types of data in the same file, so the easiest way to process is to divide it first:

<infile tee >(grep '\*@' > global) >(grep -v '\*@' > addr) > /dev/null

Then use global to remove information from addr:

grep -vf <(cut -d@ -f2 global) addr

Putting it together:

<infile tee >(grep '\*@' > global) >(grep -v '\*@' > addr) > /dev/null
cat global <(grep -vf <(cut -d@ -f2 global) addr) > outfile

Contents of outfile:

WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com

Clean up temporary files with rm global addr.

You could do:

grep -o "\*@.*" file.txt | sed -e 's/^/[^*]/' > global.txt
grep -vf global.txt file.txt

This will start by extracting the global emails, and prepend them with [^*], saving the results into global.txt. This file is then used as input to grep, where each line is treated as a regex in the form [^*]*@global.domain.com. The -v option tells grep to only print lines that don't match that pattern.

Another analogous option, using sed for in-place editing would be:

grep -o "\*@.*" file.txt | sed -e 's/^.*$/\/[^*]&\/d/' > global.sed
sed -i -f global.sed file.txt

Here's one way using GNU awk. Run like:

awk -f script.awk file.txt{,}

Contents of script.awk:

BEGIN {
    FS=","
}

FNR==NR {
    if (substr($NF,1,1) == "*") {
        array[substr($NF,2)]++
    }
    next
}

substr($NF,1,1) == "*" || !(substr($NF,index($NF,"@")) in array)

Results:

WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com

Alternatively, here's the one-liner:

awk -F, 'FNR==NR { if (substr($NF,1,1) == "*") array[substr($NF,2)]++; next } substr($NF,1,1) == "*" || !(substr($NF,index($NF,"@")) in array)' file.txt{,}

With one pass of the file and allowing for the global domains to be intermixed with the addresses:

$ cat file
WF,*@stackoverflow.com
WF,test@superuser.com
WF,*@superuser.com
WF,test@stackapps.com
WF,test@stackexchange.com
WF,*@stackexchange.com
WF,foo@stackapps.com
$
$ awk -F'[,@]' '
   $2=="*" { glbl[$3]; print; next }
   { addrs[$3] = addrs[$3] $0 ORS }
   END {
      for (dom in addrs)
         if (!(dom in glbl))
            printf "%s",addrs[dom]
   }
' file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
WF,foo@stackapps.com

or if you don't mind a 2-pass approach:

$ awk -F'[,@]' '(NR==FNR && $2=="*" && !glbl[$3]++) || (NR!=FNR && !($3 in glbl))' file file
WF,*@stackoverflow.com
WF,*@superuser.com
WF,*@stackexchange.com
WF,test@stackapps.com
WF,foo@stackapps.com

I know that second one's a bit cryptic, but it's pretty easily translated to not use the default action and a good exercise in awk idioms :-).

This might work for you (GNU sed):

sed '/.*\*\(@.*\)/!d;s||/[^*]\1/d|' file | sed -f - file

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow