Data munging in R: Subsetting and arranging vectors of uneven length

https://stackoverflow.com/questions/23103493

04-07-2023
|

Question

I am sorry I could not make a more specific title. I am trying to wean myself off of spreadsheets for the more difficult tasks and this one is giving me particular trouble - I can do it in Excel but I don't really know how to begin in R. It is somewhat hard to describe. I imagine a mix of techniques could be involved here so I hope this is of use to others.

I have data that comes in the following form from a spreadsheet:

Data:

1   GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD
2   KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR,
3   KJTJ, FKOF, VCNS, FLEP
4   FKKF, EPTR
5   QPOE,  PEOL, WJRN, VCNS, PEKF, PFPW

And this data is associated with the following key:

Key:

Items   A   B   C
ALAO    NA  0.12246503  0.137902549
ALLF    0.016262491 0.557522799 0.622560763
ALPQ    0.409770566 0.770904525 NA
CNIR    NA  0.38075281  0.698236443
EPTR    0.718354484 0.290028597 0.525661861
FKKF    0.801489091 0.878405308 0.645004844
FKOF    0.643251028 0.131643544 NA
FLEP    0.018262707 0.211220859 0.457302727
GOEK    0.902121539 NA  NA
JFOKE   0.808410498 0.301443669 0.575188395
JFPG    NA  NA  0.343824191
KENC    0.882285296 0.372821865 0.593742731
KFDL    0.077569421 0.076497291 NA
KJTJ    0.249613609     0.227241864 NA
LFOD    NA  0.000343115 0.329546051
LSPF    0.088451014 0.65148309  0.267490643
PEKF    0.645309773 NA  0.116601451
PEOL    0.626916187 0.093812247 0.152577881
PFOW    0.86690534  0.596673645 NA
PFPW    NA  0.018869604 NA
POQK    0.683221579 NA  0.472456955
PRCP    0.486488748 0.860947689 0.097916066
PWEO    0.665854791 0.814111848 0.026085774
PWEOR   0.611034332 0.17254104  0.212386401
PWKR    NA  NA  0.357298987
QPOE    0.815885005 0.083834541 NA
VCNS    0.394817612 0.250760686 0.419539549
WJRN    0.403002388 0.705142265 0.768961818
WOWP    0.794250738 NA  0.967405211

Here is the general approach:

Each row shown in data comes from one cell of a spreadsheet so it would be interpreted by R as one string if imported directly. Split the string for each row into a form that can be stored as a vector in R.

Filter the data into three categories (A, B, or C) depending on the value in the row it is associated with. For example, for the 5th row of data, we have the values: QPOE, PEOL, WJRN, VCNS, PEKF, PFPW. Looking at the key, we can turn this into three subcategories based on what is contained in A, B, or C. This is based on whether or not there is an NA in that row or not:

A QPOE PEOL WJRN VCNS PEKF B QPOE PEOL WJRN VCNS PFPW C PEOL WJRN VCNS PEKF

Now that we have divided up row 5 of our data into its respective categories, we can make a separate table for this row that includes the associated value:

A 0.815885005 0.626916187 0.403002388 0.394817612 0.645309773 B 0.083834541 0.093812247 0.705142265 0.250760686 0.018869604 C 0.152577881 0.768961818 0.419539549 0.116601451

So we have a kind of hash table... sort of. Now I want to store these values in one table. It would essentially look something like this in the final form (shown for row 5 of data only):

Cat A Item  A Value B Item  B Value C Item  C Value
5   QPOE    0.815885005 QPOE    0.083834541 PEOL    0.152577881
5   PEOL    0.626916187 PEOL    0.093812247 WJRN    0.768961818
5   WJRN    0.403002388 WJRN    0.705142265 VCNS    0.419539549
5   VCNS    0.394817612 VCNS    0.250760686 PEKF    0.116601451
5   PEKF    0.645309773 PFPW    0.018869604 NA  NA

In reality, I have 400 rows of "Cat" in data not just 5.

Is this the best way to store the data for easy reference? Would a nested list be preferred like so?

Cat Row 1
- A Items
  - Values
- B Items
  - Values
- C Items
  - Values
Cat Row 2...

I am just hesitant to make data frames for this data because there is so much variability in the length of the rows in my original data when divided into A, B, and C. The shortest ones would have to have NA's to fill up to the length of the longest ones to fit in the data frame. Something about this just makes me uncomfortable.

I can always look up the functions used in answer and figure it out so an in-depth explanation is not necessary unless your are feeling particularly generous! Thank you for your time.

Solution

I think that this is what I'd do, although it returns the answer in a slightly different form than you've asked for - my approach is to avoid ragged arrays (ones with different column lengths).

Start with your data:

d <- c("GOEK, WOWP, PEOL, WJRN, KENC, QPOE, JFPG, PWKR, PWEOR, JFOKE, POQK, LSPF, PEKF,PFOW, VCNS, ALAO, LFOD",
"KFDL, LFOD, WOWP, PWEO, PWEOR, PRCP, ALPQ, JFOKE, ALLF, VCNS CNIR",
"KJTJ, FKOF, VCNS, FLEP", "FKKF, EPTR", "QPOE,  PEOL, WJRN, VCNS, PEKF, PFPW"    )

key <- structure(list(Items = c("ALAO", "ALLF", "ALPQ", "CNIR", "EPTR",
"FKKF", "FKOF", "FLEP", "GOEK", "JFOKE", "JFPG", "KENC", "KFDL",
"KJTJ", "LFOD", "LSPF", "PEKF", "PEOL", "PFOW", "PFPW", "POQK",
"PRCP", "PWEO", "PWEOR", "PWKR", "QPOE", "VCNS", "WJRN", "WOWP"
), A = c(NA, 0.016262491, 0.409770566, NA, 0.718354484, 0.801489091,
0.643251028, 0.018262707, 0.902121539, 0.808410498, NA, 0.882285296,
0.077569421, 0.249613609, NA, 0.088451014, 0.645309773, 0.626916187,
0.86690534, NA, 0.683221579, 0.486488748, 0.665854791, 0.611034332,
NA, 0.815885005, 0.394817612, 0.403002388, 0.794250738), B = c(0.12246503,
0.557522799, 0.770904525, 0.38075281, 0.290028597, 0.878405308,
0.131643544, 0.211220859, NA, 0.301443669, NA, 0.372821865, 0.076497291,
0.227241864, 0.000343115, 0.65148309, NA, 0.093812247, 0.596673645,
0.018869604, NA, 0.860947689, 0.814111848, 0.17254104, NA, 0.083834541,
0.250760686, 0.705142265, NA), C = c(0.137902549, 0.622560763,
NA, 0.698236443, 0.525661861, 0.645004844, NA, 0.457302727, NA,
0.575188395, 0.343824191, 0.593742731, NA, NA, 0.329546051, 0.267490643,
0.116601451, 0.152577881, NA, NA, 0.472456955, 0.097916066, 0.026085774,
0.212386401, 0.357298987, NA, 0.419539549, 0.768961818, 0.967405211
)), .Names = c("Items", "A", "B", "C"), class = "data.frame", row.names = c(NA, -29L))

#split it up as you suggest
d <- strsplit(d,",")
d <- lapply(d, gsub, pattern=" ", replacement="") #Get rid of trailing spaces

#Convert key to a long data.frame with no NAs
library(reshape2)   
key <- melt(key)
names(key)[2] <- "letter" #You might have better name for this
key <- key[complete.cases(key),] 

#Extract subsets for each row of data
lapply(d, function(x)key[key$Items %in% x,])

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow