Question

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...

I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...

I have included some dummy (but representative) data below.


data state_counts;
input state $20.  enrollment;
cards;
CALIFORNIA                   440233
TEXAS                        318921
NEW YORK                     224867
FLORIDA                      181517
ILLINOIS                     162664
PENNSYLVANIA                 155958
OHIO                         141083
MICHIGAN                     124051
NEW JERSEY                   117131
GEORGIA                      104351
NORTH CAROLINA               102466
VIRGINIA                      93154
MASSACHUSETTS                 80688
INDIANA                       75784
WASHINGTON                    73764
MISSOURI                      73083
MARYLAND                      73029
WISCONSIN                     72443
TENNESSEE                     71702
ARIZONA                       69662
MINNESOTA                     66470
COLORADO                      58274
ALABAMA                       54453
LOUISIANA                     50344
KENTUCKY                      49595
CONNECTICUT                   47113
SOUTH CAROLINA                46155
OKLAHOMA                      43428
OREGON                        42039
IOWA                          38229
UTAH                          36476
KANSAS                        36469
MISSISSIPPI                   33085
ARKANSAS                      32533
NEVADA                        27545
NEBRASKA                      24571
NEW MEXICO                    22485
WEST VIRGINIA                 21149
IDAHO                         20596
NEW HAMPSHIRE                 19121
MAINE                         18213
HAWAII                        16304
RHODE ISLAND                  13802
DELAWARE                      12025
MONTANA                       11661
SOUTH DAKOTA                  11111
VERMONT                       10082
ALASKA                         9770
NORTH DAKOTA                   9614
WYOMING                        7457
DIST OF COLUMBIA               6487
;
run;

*****  calculating the cumulative frequencies by hand  ;


proc sql;
    create table dummy_3 as
        select

            state,
            enrollment,
            sum(enrollment) as total_enroll,
            enrollment / calculated total_enroll as percent_total

    from    state_counts

    order by percent_total desc ;
quit;    



data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;

Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.

Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.

Thanks...

Was it helpful?

Solution

You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:

proc freq data=state_counts noprint order=data;
   tables state / out=state_counts2;
   weight enrollment;
run;

data top3rd;
   set state_counts2;
   cum_percent+percent;
   if lag(cum_percent)<100/3 then top_third=1;
run;

OTHER TIPS

It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.

proc format;
    value pctile
    low-0.33333 = 'top third'
    0.33333<-.4 = 'next bit'
    0.4<-high = 'the rest'
    ;
run;
options fmtsearch=(work);

And add a statement at the end of your datastep:

pctile_flag = put(cum_percent,pctile.);

Rewrite your last data step like this:

data dummy_4(drop=found); 
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
    top_third = 1;
end;
else if ^found then do; 
    top_third = 1;
    found =1;
end;
else
    top_third = 0;
run;

note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.

I am not aware of a PROC that will do this for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top