Data set containing columns of unequal length to long form in R

https://stackoverflow.com/questions/22615406

r
reshape

20-06-2023
|

Pergunta

Edited to clarify that I would like NAs removed in the final data frame. The NAs were only added upon import to avoid dealing with blanks. They do not have any significance beyond that.

I have a data set (CSV file) consisting of columns of character vectors, each of which are different lengths. I would like to combine them together into long form. (I believe "long form" is the correct term in my case but please correct me if I am wrong). Below is a simple example to illustrate what I want.

When I imported my data, I filled the missing spaces with NA to avoid dealing with blanks which have caused me problems in the past. The following code simulates how the data would look upon import after filling the NAs:

Set1 <- c("A", "F", "R", "G", NA, NA, NA, NA)
Set2 <- c("G", "Q", "U", "I", "G", "D", "K", "B")
Set3 <- c("V", "S", "M", "J", "K", "L", NA, NA)
dat <- data.frame(Set1, Set2, Set3)

Which gives the following R console output:

  Set1 Set2 Set3
1    A    G    V
2    F    Q    S
3    R    U    M
4    G    I    J
5 <NA>    G    K
6 <NA>    D    L
7 <NA>    K <NA>
8 <NA>    B <NA>

I would like the data to appear in two-column format with the NAs removed. The first column will contain the column number that the letter appears in. The second column will contain each of the columns stacked on each other. I believe this is called long form but I may be mistaken. It would look like this:

   Col Char
1    1    A
2    1    F
3    1    R
4    1    G
5    2    G
6    2    Q
7    2    U
8    2    I
9    2    G
10   2    D
11   2    K
12   2    B
13   3    V
14   3    S
15   3    M
16   3    J
17   3    K
18   3    L

I have managed to make this work by a combination of the stack function, removing NAs, and a bit of code to count the number of occurrences to put them into the first column. This seems overly cumbersome and I would like to know if there is a better way to do this or a better way to handle the kind of data I have to deal with. A data frame does not seem to be the best way since the columns are different lengths but I do not know of any suitable alternatives.

The reason I need the data in this format is so I can plot it in ggplot2. There are actually corresponding numerical values for each letter that I left out of the example above for simplicity. The final result with my actual dataset will be a dot plot with the column number on the X axis, the numerical value on the y axis, and color coded by the character vectors.

Thank you for your help.

Solução 3

Here are some approaches which produce the 2 column output as shown in the question given dat:

stack

transform(na.omit(stack(lapply(dat, as.character))), ind = as.numeric(ind))

reshape

na.omit(reshape(dat, dir = "long", varying = list(names(dat)))[1:2])

Outras dicas

Here's another option, if you put your input into a list first.

sets <- list(Set1 = c("A", "F", "R", "G"),
             Set2 = c("G", "Q", "U", "I", "G", "D", "K", "B"),
             Set3 = c("V", "S", "M", "J", "K", "L"))

data.frame(Col=rep(seq_along(sets), sapply(sets, length)), Char=unlist(sets))

n <- 3 # How many Set1, Set2, etc. there are. Make sure these have no NAs yet.
# If you do not know how many there are (e.g. another user is providing them) 
# then use this:
# n <- max(as.integer(gsub('Set', '', ls()[grepl('^Set[0-9]+$', ls())])))
dat <- do.call(rbind, lapply(seq_len(n), function(ind) {
  set <- get(paste0("Set", ind)) # Fetch SetX where X is the current index
  set <- set[!is.na(set)] # remove NAs just in case. Delete this line if no Sets have any
  data.frame(Col = rep.int(ind, length(set)), Char = set)
}))

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow