Removing similar rows in SAS
-
25-05-2021 - |
Domanda
I've got two tables with similar structure:
- First table: id and col1,col2,col3 - all numerics.
- Second table: id and col4,col5,col6 - all numerics.
I want to remove from the first one all rows which are similar to any of the rows from the second tagble. I consider a row to be similiar to other row when any column from the group col1-col3 is equal to any of the columns from the group col4-col6. Now I'm doing it in 9 consecutive data steps (first checks whether col1=col4, second col1=col5 , ..., ninth col3=col6), which probably is not the optimal solution.
Any ideas how to improve this?
Soluzione
This is my solution:
data vec1;
set ds2;
array cvar{*} col4 col5 col6;
do ijk=1 to dim(cvar);
compvar=cvar(ijk);
output;
end;
run;
proc sql noprint;
select distinct compvar into :cvars separated by ' '
from vec1;
quit;
%let numcvar=&sqlobs;
data ds1(drop=i);
set ds1;
array myvar(i) col:;
do over myvar;
if myvar in (&cvars.) then delete;
end;
run;
If you run into trouble with the length of the CVARS macro variable you could use this instead:
data vec1;
set ds2;
array cvar{*} col:;
do ijk=1 to dim(cvar);
compvar=cvar(ijk);
output;
end;
run;
proc sort data=vec1 out=vec2(keep=compvar) nodupkey;
by compvar;
run;
proc transpose data=vec2 out=flat prefix=x;
run;
data ds1(keep=id col:);
set ds1b;
if _n_=1 then set flat;
array myvar(i) col:;
array xvar(j) x:;
do over myvar;
do over xvar;
if myvar=xvar then delete;
end;
end;
run;
The PROC SORT could be eliminated but it makes it more efficient for big data sets.
Or you could generate a format on the fly:
data vec1;
set ds2;
array cvar{*} col4 col5 col6;
do ijk=1 to dim(cvar);
compvar=cvar(ijk);
output;
end;
run;
proc sort data=vec1 out=vec2 nodupkey;
by compvar;
run;
data fmt1;
set vec2;
length start $20;
fmtname="remobs";
start=compress(put(compvar,best.));
label="remove";
run;
proc format lib=work cntlin=fmt1;
run;
data ds1(drop=i);
set ds1;
array myvar(i) col:;
do over myvar;
if put(myvar,remobs.)="remove" then delete;
end;
run;
I suspect this last method would be faster than the two preceding solutions.
UPDATE
Using hash objects
data vec1;
set ds2;
array cvar{*} col4 col5 col6;
do ijk=1 to dim(cvar);
compvar=cvar(ijk);
output;
end;
run;
proc sort data=vec1 out=vec2 nodupkey;
by compvar;
run;
data ds1_new(keep=id col1 col2 col3);
if _n_ = 0 then set work.vec2;
declare hash myhash(DATASET:'work.vec2') ;
rc=myhash.defineKey('compvar');
myhash.defineDone();
set ds1;
array rcarr{*} rc1-rc3;
array lookup{*} col1 col2 col3;
do i=1 to dim(lookup);
rcarr(i)=myhash.find(key: lookup(i));
if rcarr(i)=0 then delete;
end;
run;
Altri suggerimenti
ok, 2nd attempt to answer this. I've created a cartesian join of the 2 datasets in order to match every row in table 1 with every row in table 2. You can then use the arrays to find out which rows have repeat values.
data ds1;
input id col1 col2 col3;
cards;
1 10 20 30
2 40 50 60
3 70 80 90
4 15 25 35
5 45 55 65
;
run;
data ds2;
input id col4 col5 col6;
cards;
10 100 200 300
12 60 50 600
13 700 800 70
16 15 20 300
;
run;
proc sql;
create view all_cols as select
ds1.id as id1, ds2.id as id2,* from ds1,ds2;
quit;
data match;
set all_cols (keep=id1 id2 col:);
array vars1{*} col1-col3;
array vars2{*} col4-col6;
do i=1 to dim(vars1);
do j=1 to dim(vars2);
if vars1{i}=vars2{j} then do;
output;
return;
end;
end;
end;
drop i j;
run;
proc sort data=match;
by id1;
run;
data ds1;
modify ds1 match (in=b keep=id1 rename=(id1=id));
by id;
if b then remove;
run;