Using R to parse out Surveymonkey csv files

https://stackoverflow.com/questions/7881766

11-02-2021
|

Pregunta

I'm trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over two lines.

Has anybody found a simple way of managing the headers in the CSV file so that the analysis is manageable ?
How do other people analyse results from Surveymonkey?

Thanks!

Solución 2

What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as

 m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)

and then just did the analysis against m1$V10, m1$V23 etc...

To get around the mess of multiple columns I used the following little function

# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
    # e.g. mcols(df, c(14:18))
        exp <- paste('df[,', cols, ']', sep='', collapse=',' )
        # this creates something like...
        # "df[,14],df[,15],df[,16],df[,17],df[,18]"
        # now we just want to do a paste of this expression...
        nexp <- paste(" paste(", exp, ", sep=' ')")
        # so now nexp looks something like...
        # " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
        # now we just need to parse this text... and eval() it...
        newcol <- eval(parse(text=nexp))
        newcol <- gsub('  *', ' ', newcol) # replace duplicate spaces by a single one
        newcol <- gsub('^ *', '', newcol) # remove leading spaces
        gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))

No doubt somebody will be able to clean this up!

To tidy up Likert-like scales I used:

# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
  xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
  y <- ifelse(x == '', NA, x)
  ordered(y, levels=xlevels)
}

for (i in 44:52) {
  m2[,i] <- tidylik4(m2[,i])
}

Feel free to comment as no doubt this will come up again!

Otros consejos

You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'

surveymonkey export

As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software). Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.

enter image description here

I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.

#' @param x The path to a surveymonkey csv file
fix_names <- function(x) {
  rs <- read.csv(
    x,
    nrows = 2,
    stringsAsFactors = FALSE,
    header = FALSE,
    check.names = FALSE, 
    na.strings = "",
    encoding = "UTF-8"
  )

  rs[rs == ""] <- NA
  rs[rs == "NA"] <- "Not applicable"
  rs[rs == "Response"] <- NA
  rs[rs == "Open-Ended Response"] <- NA

  nms <- c()

  for(i in 1:ncol(rs)) {

    current_top <- rs[1,i]
    current_bottom <- rs[2,i]

    if(i + 1 < ncol(rs)) {
      coming_top <- rs[1, i+1]
      coming_bottom <- rs[2, i+1]
    }

    if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
      pre <- current_top

    if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
      nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")

    if(!is.na(current_top) & is.na(current_bottom))
      nms[i] <- current_top

  }


  nms
}

If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings and stringsAsFactor = FALSE.

nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms

The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.

With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.

Then, you can kill the second row of the data and have a tidy data frame.

for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
} 

df <- df[-1,]

Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.

For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....

Here's a base R solution:

sm_header_function <- function(x, rep_val){
  
  orig <- x
  
  sv <- x
  sv <- sv[1,]
  sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
  sv <- t(sv)
  sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
  names(sv)[1] <- "name"
  names(sv)[2] <- "value"
  sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
  sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
  sv$new_value <- paste0(sv$new_value, " ", sv$value)
  new_names <- as.character(sv$new_value)
  colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
  orig <- orig[-c(1),]
  return(orig)
}

sm_header_function(df, "X")
sm_header_function(df, "...")

With some sample data, the change in column names would look like this:

Original export from SurveyMonkey:

> colnames(sample)
 [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
 [4] "...12"                                         "...13"                                         "...14"                                        
 [7] "...15"                                         "...16"                                         "...17"                                        
[10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."

Cleaned export from SurveyMonkey:

> colnames(sample_clean)
 [1] "Respondent ID"                                            "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"

Sample data:

structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
"Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

How about the following: use read.csv() with header=FALSE. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste() the two rows/sentences of together. Finally, use colnames().

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow