Question

I have a dataset unique across 5 variables. The 5th is an identifier variable. Finally, I have a 6th variable, which is dependent on the identifier variable.

The identifier variable can appear in multiple places. The dependent variable will never change for a given value of the identifier variable.

I have code such as the following:

proc sql;
select
...
, count(distinct identifier) as n_ids
from
group by

Which selects the number of unique identifiers per group of 4 independent variables. I'm hoping to add on to this the sum of the 6th variable, which would be something like the following:

sum(case when distinct identifier then dependent_var else 0 end)

Which obviously does not work (and for good reason). Any clean way of finding this sum within the sql step?

Was it helpful?

Solution

The easiest solution is probably to summarize the dataset first by the identifier:

proc sql;
select biggerstuff, identifier, max(depvar) 
from yourdataset
group by biggerstuff,identifier;
quit;

Then insert that in your larger query in the place of the 'from' statement (select blah, count(identifier), sum(depvar)). Once you've pre-summarized it in the inner query you know that you only get 1 row per identifier so distinct isn't needed any longer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top