I have environmental data with missing values. The measurement of some of these variables started at different years.

With the script “sapply(df, function(x) sum(is.na(x)))" I get the number of missing values for each column. But my wish is to count the missing values from the time point when at least one measurement was available. For example for o3 the missing values should be only 3 from the time measurement of o3 started. n addition I want to extract the first date when the measurement is available(example temp is on 01-03-1990 and 03 is on 09-03-1990). In short my wish is:

1.  Extract the first date of available measurement for each column.
2.  Count the number of missing values after at least one measurement is available.

Sample data follows

> dput(df)
structure(list(date = structure(c(7364, 7365, 7366, 7367, 7368, 
7369, 7370, 7371, 7372, 7373, 7374, 7375, 7376, 7377, 7378, 7379, 
7380, 7381, 7382, 7383, 7384), class = "Date"), no2 = c(51.7008334795634, 
33.8999998569489, 29.7854166030884, 29.0558333396912, 28.5108333031336, 
31.9637500842412, 36.1283330917358, 24.6608331998189, 33.2682609558105, 
NA, NA, NA, 53.1133330663045, 54.1575004259745, 43.7712502479553, 
31.0166666905085, 31.9995832443237, 33.3491666316986, NA, NA, 
35.5604347353396), temp = c(1.12583327293396, 0.230416655540466, 
-0.415833324193954, 3.50333333015442, 4.88708353042603, 3.54916667938232, 
2.15291666984558, 6.84916687011719, 3.79416656494141, 1.50416672229767, 
0.736666679382324, 3.33291673660278, -0.466250002384186, 1.47374999523163, 
6.84124994277954, 9.93249988555908, NA, NA, NA, 6.88000011444092, 
6.19999980926514), humidity = c(NA, 75.1428604125977, 64.375, 
NA, 82.125, 61.375, 71.5, 68.25, NA, 74, 82.375, 82.5, 60.875, 
80, 82.625, 88.75, 78.5, 73.125, 68.5, 49.2811088562012, 79.8091659545898
), o3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 63.0712509155273, 69.6487503051758, 
60.903751373291, NA, 72.942497253418, NA, NA, 66.2587509155273, 
78.3262481689453, 101.066246032715, 112.137496948242, 77.0224990844727, 
68.5950012207031)), .Names = c("date", "no2", "temp", "humidity", 
"o3"), row.names = c("60", "61", "62", "63", "64", "65", "66", 
"67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", 
"78", "79", "80"), class = "data.frame")
有帮助吗?

解决方案

To get the first non-missing value:

first <- sapply(df, function(x) which(!is.na(x))[1])
dateOfFirst <- df$date[first]

and then the number of NA's after the first run of NA's is the total number of NA's, take away the length of the initial run

numberOfMissing <- sapply(df, function(x) sum(is.na(x))) - (first-1)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top