Question

I have the following data.frame which reports various data on countries over various years. The data is disaggregated by urban/rural, urban slum/urban non-slum, and capital city/other urban centres. Sadly the data is a bit patchy so not every country has data reported in the same year, and across all indicators.

I am trying to subset the data to produce some plots to compare the most recent data from every country. I've created a column in the data.frame called 'latest' which reports whether a row is the most recent year. However, when I am trying to compare say slum/non-slum - the data available is not always the most recent. I'd therefore like to create a subset which looks to see whether data is present in a given row, if it isn't, I'd like to select data from the next most recent year.

I have a feeling this could be achieved by using the order of the factored variable 'Year' but no idea how to go about this. I can select only the rows with data in them, but this gives me multiple entries for each country as follows:

fever[(fever$Non.slum!='NA'),]

Produces this:

      COUNTRY   Year    Urban    Rural    Total Capital.City Other.Cities..towns Non.slum Slum latest
NA       <NA>   <NA>       NA       NA       NA           NA                  NA       NA   NA   <NA>
NA.1     <NA>   <NA>       NA       NA       NA           NA                  NA       NA   NA   <NA>
3    Ethiopia   2011 14.78709 16.03735 15.87641     11.86713            15.28213     10.6 15.4      y
4    Ethiopia   2005 16.00000 18.90000 16.90000     15.70000            16.10000     15.0 16.1      n
5    Ethiopia   2000 22.38637 25.18128 24.90000     19.49970            22.86689     19.9 22.6      n
6       Kenya 2008/9 20.71574 22.58868 22.20000     16.99561            22.39136     19.2 21.8      y
7       Kenya   2003 39.78713 40.75334 40.56866     38.45388            40.49664     31.7 42.8      n
8       Kenya   1998 41.67932 42.44481 42.30155     38.79310            43.27112     36.3 43.7      n
NA.2     <NA>   <NA>       NA       NA       NA           NA                  NA       NA   NA   <NA>
10    Lesotho   2009 12.93654 16.22281 17.90000     13.25208            12.69136     11.7 13.6      y

So what I need is a function to select only those rows where data exists in the Slum/Non.Slum column, but only a single entry per COUNTRY based on the most recent data available.

I've searched through the forum to try and find an answer but not getting very far:(

Can anyone offer any handy advice?

Thanks

p.s. here's my data:

structure(list(COUNTRY = structure(c(1L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 
8L, 8L, 9L, 9L, 9L, 9L, 9L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 
12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L), .Label = c("Comoros", 
"Eritrea", "Ethiopia", "Kenya", "Lesotho", "Madagascar", "Malawi", 
"Namibia", "Rwanda", "Swaziland", "Tanzania", "Uganda", "Zambia", 
"Zimbabwe"), class = "factor"), Year = structure(c(5L, 12L, 25L, 
16L, 9L, 22L, 13L, 7L, 2L, 23L, 15L, 22L, 14L, 6L, 1L, 24L, 15L, 
9L, 1L, 13L, 6L, 19L, 9L, 1L, 24L, 21L, 16L, 9L, 1L, 19L, 24L, 
21L, 15L, 8L, 5L, 1L, 18L, 10L, 4L, 20L, 11L, 5L, 1L, 24L, 17L, 
8L, 3L), .Label = c("1992", "1993", "1994", "1995", "1996", "1997", 
"1998", "1999", "2000", "2000/1", "2001/2", "2002", "2003", "2003/4", 
"2004", "2005", "2005/6", "2006", "2006/7", "2007", "2007/8", 
"2008/9", "2009", "2010", "2011"), class = "factor"), Urban = c(47.8, 
24.2, 14.7870851371451, 16, 22.3863741902043, 20.7157413622361, 
39.7871349722997, 41.6793203690612, 35.8582154033059, 12.9365423294414, 
19.8478428266605, 11.9207464780274, 18.5676950229307, 27.6081260825543, 
22.9, 30.7, 29.8676525754328, 28.8769863995411, 36.7350808997634, 
23.6685395495197, 43.5924904552921, 15.3818829930695, 20.9, 28.4927963558185, 
16.7, 18.1130004296917, 25.3, 19, 32.2, 17.6, 29.7, 20.7, 22.5, 
29.6313219134818, 30.5, 31.6001273852453, 25, 32.9, 35.2, 16.3, 
33.1, 38.1, 33.7, 8.65178666948846, 7.3, 22.6, 34.5), Rural = c(47.6, 
32.7, 16.0373484732733, 18.9, 25.181276309133, 22.5886832681651, 
40.7533401938621, 42.4448145032958, 38.5298174751626, 16.2228067346473, 
26.465049129342, 8.41898094643249, 19.2257425400682, 29.635864119259, 
27.7, 35.1, 38.283749104983, 37.0204386868532, 40.4553536902836, 
23.6050855848523, 38.7593744908809, 16.2668541968914, 18.7, 36.7752452450324, 
15.6, 20.1615269604521, 26.4, 31, 42.1, 30.3, 21.2, 18.4, 24.9, 
31.338473181485, 30.3, 26.3897272106662, 43, 45.3, 47.8, 18.5, 
47.6, 41.4, 51.8, 10.1757289609584, 7.6, 27.3, 41.5), Total = c(47.6, 
29.8, 15.8764113925424, 16.9, 24.9, 22.2, 40.5686598193627, 42.3015496943942, 
38.2, 17.9, 25.5279161214695, 8.8, 19.1, 29.2, 27.1, 34.5, 37.1294935260729, 
36, 40.0371752616418, 23.6, 39.8, 15.9, 19.4, 34.0357824734553, 
15.8, 19.9, 26.2, 29.1, 41.6, 27.5, 22.9, 18.8, 24.4, 31, 30.3, 
27.5, 40.9, 43.9, 46.3, 17.8, 43.1, 40.1, 43.2, 9.72279457486365, 
7.5, 25.8, 39.7), Capital.City = c(62.5, 19.3, 11.8671319871973, 
15.7, 19.4996995263646, 16.99560676463, 38.4538776537224, 38.7931034482758, 
34.1584158415842, 13.2520773874409, 12.7659574468085, 15.6873992936943, 
14.9619843565563, 20.3036710627491, 19.5, NA, 32.011454861578, 
26.1111111111111, 33.2046332046333, 27.514648271213, 43.2946409100591, 
17.6134198692098, 21.2, 28.4927963558185, 17.4, 16.0904522908004, 
26.6, 22.7, 31.5, 13.4, NA, NA, 26.1, 29.1331564646512, 29.4, 
33.3949166628871, 18.9, 29.8, 30.5, 11, 32.3, 38.7, 26.3, 7.0408031903801, 
9.8, 26.7, 38.5), Other.Cities..towns = c(43.7, 27.7, 15.2821312519876, 
16.1, 22.8668864784677, 22.3913621285115, 40.4966412598569, 43.2711212401314, 
36.8054151596036, 12.6913635462951, 25.601742942828, 9.93629083208327, 
20.5991643631925, 31.5220710266449, NA, NA, 28.6811759983293, 
30.211847684812, 38.1705931383969, 22.8969512882811, 43.6420373588114, 
13.7990084372002, 20.4, 28.6992700604113, NA, 19.6243357194177, 
24.4, 17.5, 33, 22.1, NA, NA, 21.5, 29.7793073823722, 30.9, 31.1616047201402, 
29.9, 35.6, 39.4, 18.5, 33.4, 37.7, 36.3, 10.2081343223745, 5.3, 
18.9, 30), Non.slum = c(NA, NA, 10.6, 15, 19.9, 19.2, 31.7, 36.3, 
NA, 11.7, 18.8, NA, NA, NA, NA, NA, 24.7, 25.3, 33.7, 13.6, 35.1, 
15.6, 19.4, 36.8, NA, 15.4, 21.1, NA, 21.6, 18.1, NA, NA, 21.6, 
35.7, 24, 32, 15.9, 30.8, 28.4, 13.9, 28.5, 35.7, 29.3, 7.6, 
7.2, 21.8, 29.7), Slum = c(NA, NA, 15.4, 16.1, 22.6, 21.8, 42.8, 
43.7, NA, 13.6, 21.3, NA, NA, NA, NA, NA, 31.4, 31, 37, 24.2, 
44.8, 15.1, 23, 34, NA, 18.8, 27, NA, 34.9, 17.1, NA, NA, 22.8, 
26.5, 32.3, 31.5, 28.5, 34.2, 36, 17.9, 38.6, 39.3, 35.7, 10.1, 
7.4, 31.4, 41.1), latest = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L), .Label = c("n", "y"), class = "factor")), .Names = c("COUNTRY", 
"Year", "Urban", "Rural", "Total", "Capital.City", "Other.Cities..towns", 
"Non.slum", "Slum", "latest"), row.names = c(NA, -47L), class = "data.frame")
Was it helpful?

Solution

You can use subset to select only those rows where both No.slum and Slum are not NA. Then, this subset, tmp, can be used to remove rows with duplicated COUNTRYs. Therefore, only the rows for the latest Year will remain.

tmp <- subset(fever, !is.na(Non.slum) & !is.na(Slum))
res <- tmp[!duplicated(tmp$COUNTRY), ]

This returns:

     COUNTRY   Year     Urban    Rural     Total Capital.City Other.Cities..towns Non.slum Slum latest
3   Ethiopia   2011 14.787085 16.03735 15.876411    11.867132            15.28213     10.6 15.4      y
6      Kenya 2008/9 20.715741 22.58868 22.200000    16.995607            22.39136     19.2 21.8      y
10   Lesotho   2009 12.936542 16.22281 17.900000    13.252077            12.69136     11.7 13.6      y
17    Malawi   2004 29.867653 38.28375 37.129494    32.011455            28.68118     24.7 31.4      n
22   Namibia 2006/7 15.381883 16.26685 15.900000    17.613420            13.79901     15.6 15.1      y
26    Rwanda 2007/8 18.113000 20.16153 19.900000    16.090452            19.62434     15.4 18.8      n
30 Swaziland 2006/7 17.600000 30.30000 27.500000    13.400000            22.10000     18.1 17.1      y
33  Tanzania   2004 22.500000 24.90000 24.400000    26.100000            21.50000     21.6 22.8      n
37    Uganda   2006 25.000000 43.00000 40.900000    18.900000            29.90000     15.9 28.5      y
40    Zambia   2007 16.300000 18.50000 17.800000    11.000000            18.50000     13.9 17.9      y
44  Zimbabwe   2010  8.651787 10.17573  9.722795     7.040803            10.20813      7.6 10.1      y

OTHER TIPS

Here are some directions that you may follow. Since you asked what I need is a function to select only those rows where data exists in the Slum/Non.Slum column, but only a single entry per COUNTRY based on the most recent data available, I suggest:

  1. consider the complete.cases function in R to select only those rows where data exist in the Slum/No.Slum column. To do this, you need to first subset your data to these two columns (into a new data frame, say), and then apply complete.cases. The result is a vector of TRUE/FALSE to be used for subsetting on the complete dataset.

  2. since you want a single entry per COUNTRY on the subsetted data frame in step #1 above, and that single entry is based on the most recent year, you can first sort your data frame by the Year column (treat the year numbers as numeric) within each COUNTRY, and then select the first entry of each COUNTRY to guarantee you've selected the most recent year. One issue is that, in the Year sorting part, some entry such as 2008/9 may pose a problem. Be careful about these entries.

I suspect you want something more generic than the specific problem that you ask. Here is my try which will work also on other variables

Make an new ID by pasting country and year, we call it country.year

df$country.year<-paste(df$COUNTRY, df$Year, sep="-")

You said you don't want NA's in Slum and Non.slum so we remove them to a REDUCED dataset

df.red<-df[ !is.na(df$Slum) & !is.na(df$Non.slum) ,]

We construct a lookup table that outputs the maximum year of every Country (of the reduced dataset since you dont want NA's in Slum/Non.slum)

lookup<-tapply(as.numeric(df.red$Year), df.red$COUNTRY, max)

We reformat the lookup table to become the maximum country-year

lookup<-paste(rownames(lookup), as.numeric(lookup), sep="-")

Now we filter from the REDUCED dataset all country-years that are not in the lookup

df.reconstructed<-df.red[  lookup %in% df$country.year  , ]

Hope it helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top