Question

I have a data set with this structure:

  region1 region2 region3
1      10       5       5
2       8      10       8
3      13      15      12
4       3      17      11
5      17               9
6      12              15
7       4              
8      18              
9       1               

I need:

   item region1 region2 region3
1     1       1       0       0
2     3       1       0       0
3     4       1       0       0
4     5       0       1       1
5     8       1       0       1
6     9       0       0       1
7    10       1       1       0
8    11       0       0       1
9    12       1       0       1
10   13       1       0       0
11   15       0       1       1
12   17       1       1       0
13   18       1       0       0

The plan was to get a distinct list of items, left join each of the regions as its own column and replace matches with 1s, missing with 0; but I must be missing a key point of the R merge, dropping the main column of interest. Any advice is greatly appreciated! I'd prefer an R solution, but my next step would be to look into sqldf package.

#read in data
regions <- read.csv("c:/data/regions.csv")

#get unique list of items from all regions
items <- na.omit(unique(stack(regions)[1]))

#merge distinct items with each region, replace matches with 1, missings with 0
merge.test <- merge(items,regions,by.x="values", by.y=c("region1"), all=TRUE)
Was it helpful?

Solution 2

The existing answers are fine, but they seem to complicated. Just try stack + table instead:

table(stack(dat))
#       ind
# values region1 region2 region3
#     1        1       0       0
#     3        1       0       0
#     4        1       0       0
#     5        0       1       1
#     8        1       0       1
#     9        0       0       1
#     10       1       1       0
#     11       0       0       1
#     12       1       0       1
#     15       0       1       1
#     17       1       1       0
#     18       1       0       0

I'm also going to go out on a limb and say that considering your current approach, you actually have a data.frame not a list:

DAT <- dat
Len <- max(sapply(DAT, length))
DAT <- data.frame(lapply(DAT, function(x) { length(x) <- Len; x }))

In that case, the solution is no different:

table(stack(DAT))
#       ind
# values region1 region2 region3
#     1        1       0       0
#     3        1       0       0
#     4        1       0       0
#     5        0       1       1
#     8        1       0       1
#     9        0       0       1
#     10       1       1       0
#     11       0       0       1
#     12       1       0       1
#     15       0       1       1
#     17       1       1       0
#     18       1       0       0

OTHER TIPS

Helps to provide a reproducible example (i.e. give us an easy copy-paste command to construct your sample data).

You didn't say, so I guess your data is in a list perhaps?

dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
            region2=c(5,10,15,17),
            region3=c(5,8,12,11,9,15))

First find all the items (perhaps no need to sort, I just did it because yours is sorted)

ids <- sort(unique(unlist(dat)))

Then for each region, just see if the list of unique IDs is in that region, coercing the logical TRUE/FALSE to 0 and 1 (you could leave as T/F if that would do for you)

data.frame(ids,
    region1=as.integer(ids %in% dat$region1),
    region2=as.integer(ids %in% dat$region2),
    region3=as.integer(ids %in% dat$region3))

If you have just 3 regions that's OK, if you have more you might want to automate the typing:

cols <- lapply(dat, function (region) as.integer(ids %in% region))
cols$id <- ids
df <- do.call(data.frame, cols)

where do.call calls the data.frame function with the list cols as its (named) arguments, i.e. it just does

data.frame(id=..., region1=..., region2=..., region3=...)

If your original dat was a CSV and each column has NA values you might want to insert na.omit as appropriate.

Using @mathematical.coffee's example and qdap:

dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
            region2=c(5,10,15,17),
            region3=c(5,8,12,11,9,15))

library(qdap)
matrix2df(t(mtabulate(dat)), "item")

You may need to expand with:

FUN <- function(x) as.numeric(x > 0)
matrix2df(apply(t(mtabulate(dat)), 2, FUN), "item")

If you have more than one item in in a vector.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top