Convert "select all that apply" to binary choices

Question 1

Here's a couple ways to do this with plyr or data.table.

all_ethnicities <- unique(c(
    unlist(strsplit(df$ethnicity, " ")),
    unlist(strsplit(df$ethnicity_other, " "))
    ))

df$id <- 1:nrow(df)

library(plyr)

ddply(df, .(id), function(x)
      table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
                   levels = all_ethnicities)))

##    id ngoni bemba lozi tonga other tongi luvale
## 1  1     1     0    0     0     0     0      0
## 2  2     0     1    0     0     0     0      0
## 3  3     0     0    1     1     0     0      1
## 4  4     0     1    0     1     1     0      0
## 5  5     0     1    0     0     0     1      0

library(data.table)

DT <- data.table(df)

DT[, {
    as.list(
        table(
            factor(
                unlist(strsplit(paste(ethnicity, ethnicity_other),  " ")),
                levels = all_ethnicities)
            ),
        )
}, by = id]

##     id ngoni bemba lozi tonga other tongi luvale
## 1:  1     1     0    0     0     0     0      0
## 2:  2     0     1    0     0     0     0      0
## 3:  3     0     0    1     1     0     0      1
## 4:  4     0     1    0     1     1     0      0
## 5:  5     0     1    0     0     0     1      0

Question 2

Here is how I would do it:

First, you need something to store the ethnicities of each participant. My way to do it is to build a list of these:

ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )

For your particular example, we would have:

> ethnicities
[[1]]
[1] "ngoni"

[[2]]
[1] "bemba"

[[3]]
[1] "lozi"  "tonga"

[[4]]
[1] "bemba" "tonga" "other"

[[5]]
[1] "bemba" "tongi"

And then, to iterate over these to fill your data.frame df:

for (i in seq_along(ethnicities)) {
  for (eth in ethnicities[[i]]) {
    df[[paste0('ethnicity_',eth)]][i]=1
  }
}

The resulting value for df should be:

> df
  age         ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1  24             ngoni              NA               NA               1              NA
2  28             bemba              NA               NA              NA               1
3  44        lozi tonga              NA               NA              NA              NA
4  55 bemba tonga other               1               NA              NA               1
5  53       bemba tongi              NA               NA              NA               1
  ethnicity_lozi ethnicity_tonga ethnicity_tongi
1             NA              NA              NA
2             NA              NA              NA
3              1               1              NA
4             NA               1              NA
5             NA              NA               1

There are other ways to do it. You could also pack these two for loops in sapply, but I have the feeling that the resulting code would not be more efficient (but would be more complicated to read!).

Does this help?

edit:

BTW, if you really want 0 instead of NA in your data.frame, it is as easy as changing your code initializing the added columns:

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- 0 # instead of NA
> }

Question 3

Here's an approach using concat.split.expanded from my "splitstackshape" package:

## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity, 
                      ifelse(is.na(df$ethnicity_other), "", 
                             as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL

## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ", 
                      type="character", fill=0, drop=TRUE)
#   age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1  24               0              0                0               1
# 2  28               1              0                0               0
# 3  44               0              1                1               0
# 4  55               1              0                0               0
# 5  53               1              0                0               0
#   ethnicity_other ethnicity_tonga ethnicity_tongi
# 1               0               0               0
# 2               0               0               0
# 3               0               1               0
# 4               1               1               0
# 5               0               0               1

The fill argument can easily be set to anything else you want. It defaults to NA, but here, I've set it to 0 since I think that's what you're looking for.