Question

I am trying to create a dummy variable A to indicate the number of ID appearances. If ID appears exactly twice, then A equals 0; if ID appears more than 2 times, then A equals 1.

My desired output is as follows:

  ID     YEAR  A
 1078    1989  0
 1078    1999  0
 1161    1969  0
 1161    2002  0
 1230    1995  0
 1230    2002  0
 1279    1996  0
 1279    2003  0
 1447    1993  0
 1447    2001  0
 1487    1967  1
 1487    2008  1
 1487    2009  1
 1678    1979  0
 1678    2002  0

My code is:

data new data;
  set data;
  by ID YEAR;
  if First.ID then count=0;
  count+1;
run;


data newdata;
  set newdata;
  by ID YEAR count;
  if sum(count)=3 then A=0;
  if sum(count)>3 then A=1;
run;

But the output is incorrect.

Was it helpful?

Solution

You can't use sum(count) to sum the count across multiple observations. Instead you need to first count the observations, and then merge the counts into the original data.

Data step solution

data newdata;
set data;
by ID YEAR;
if First.ID then count=0;
count+1;
if last.id then output;
drop year;
run;

if last.id then output means that newdata will only have the last observation for each id. We want that because the last observation has the county of the number of observations per id.

This step merges the original data "data" with the counts from "newdata" and computes a.

data newdata2;
merge data newdata;
by ID;
if count=2 then A=0;
if count>2 then A=1;
drop count;
run;

proc sql solution You can do it even more quickly with proc sql:

proc sql;
    create table newdata as
    select id, year, (case when count(*) = 2 then 0 else 1 end) as a
    from data
    group by id;
quit;

group by id means that summary statistics such as count() are calculated by id, so the count(*) means count observations in each id. (case when count(*) = 2 then 0 else 1 end) as a creates a variable a. If the count for a given id is 2 then a takes the value of 0, else a takes the value of 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top