how to remove duplicates in SAS data step

https://stackoverflow.com/questions/21775082

11-10-2022
|

Question

How to remove duplicates in SAS data step.

data uscpi;
      input year month cpi;
   datalines;
   1990  6 129.9
   1990  7 130.4
   1990  8 131.6
   1990  9 132.7
   1991  4 135.2
   1991  5 135.6
   1991  6 136.0
   1991  7 136.2
   ;
   run;

PROC SORT DATA = uscpi OUT = uscpi_dist NODUPKEY; 
 BY year ; 
 RUN;

i can with proc step, but how to remove it in data step. Thanks in advance

Solution

You can use the first. & last. automatic variables created by SAS when using by-group processing. They give more control on which row you consider as duplicate. Please read the manual to understand by group processing in a Data Step

 data uscpi_dedupedByYear;
 set uscpi_sorted;
 by year;
 if first.year; /*only keep the first occurence of each distinct year.  */
 /*if last.year; */ /*only keep the last occurence of each distinct year*/

 run;

A lot depends on who your input dataset is sorted. For ex: If your input dataset is sorted by year & month and you use if first.year; then you can see that it only keeps the earliest month in any given year. However, if your dataset is sorted by year & descending month then if first.year; retains last month in any given year.

This behaviour obviously differs from how nodupkey works.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow