Count missing values after first recorded measurement

https://stackoverflow.com/questions/22557231

18-06-2023
|

题

I have environmental data with missing values. The measurement of some of these variables started at different years.

With the script “sapply(df, function(x) sum(is.na(x)))" I get the number of missing values for each column. But my wish is to count the missing values from the time point when at least one measurement was available. For example for o3 the missing values should be only 3 from the time measurement of o3 started. n addition I want to extract the first date when the measurement is available(example temp is on 01-03-1990 and 03 is on 09-03-1990). In short my wish is:

1.  Extract the first date of available measurement for each column.
2.  Count the number of missing values after at least one measurement is available.

Sample data follows

> dput(df)
structure(list(date = structure(c(7364, 7365, 7366, 7367, 7368, 
7369, 7370, 7371, 7372, 7373, 7374, 7375, 7376, 7377, 7378, 7379, 
7380, 7381, 7382, 7383, 7384), class = "Date"), no2 = c(51.7008334795634, 
33.8999998569489, 29.7854166030884, 29.0558333396912, 28.5108333031336, 
31.9637500842412, 36.1283330917358, 24.6608331998189, 33.2682609558105, 
NA, NA, NA, 53.1133330663045, 54.1575004259745, 43.7712502479553, 
31.0166666905085, 31.9995832443237, 33.3491666316986, NA, NA, 
35.5604347353396), temp = c(1.12583327293396, 0.230416655540466, 
-0.415833324193954, 3.50333333015442, 4.88708353042603, 3.54916667938232, 
2.15291666984558, 6.84916687011719, 3.79416656494141, 1.50416672229767, 
0.736666679382324, 3.33291673660278, -0.466250002384186, 1.47374999523163, 
6.84124994277954, 9.93249988555908, NA, NA, NA, 6.88000011444092, 
6.19999980926514), humidity = c(NA, 75.1428604125977, 64.375, 
NA, 82.125, 61.375, 71.5, 68.25, NA, 74, 82.375, 82.5, 60.875, 
80, 82.625, 88.75, 78.5, 73.125, 68.5, 49.2811088562012, 79.8091659545898
), o3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 63.0712509155273, 69.6487503051758, 
60.903751373291, NA, 72.942497253418, NA, NA, 66.2587509155273, 
78.3262481689453, 101.066246032715, 112.137496948242, 77.0224990844727, 
68.5950012207031)), .Names = c("date", "no2", "temp", "humidity", 
"o3"), row.names = c("60", "61", "62", "63", "64", "65", "66", 
"67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", 
"78", "79", "80"), class = "data.frame")

解决方案

To get the first non-missing value:

first <- sapply(df, function(x) which(!is.na(x))[1])
dateOfFirst <- df$date[first]

and then the number of NA's after the first run of NA's is the total number of NA's, take away the length of the initial run

numberOfMissing <- sapply(df, function(x) sum(is.na(x))) - (first-1)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow