Question

R version: 2.12.2 / OS: Windows XP

Data that I need to query and manipulate is stored in a SPSS .sav file (size ~ 112MB) and I have a very basic knowledge of SPSS syntax. I am much more comfortable with R (and generally prefer it for data manipulation and analysis), so I saved the SPSS file as a tab-delimited file (size ~ 85MB) and have attempted to query it using functionality from the RODBC package.

Using the 'Data Sources (ODBC)' application within the Administrative Tools group on Windows Control Panel, I setup a database (Windows folder) which has tables (tab-delimited files) with the Microsoft Text Driver.

The current table of interest has a 'Date' field (called 'DateTime') where dates are formatted '%Y/%m/%d'. When I first tried to query the database with conditions on this field, I was returned a result set with zero observations. So, I changed 'DateTime' to datatype 'Character' and tried my query again (to no avail).

I am able to successfully query the database, i.e.

sqlQuery(channel = cn, query = "SELECT * FROM large_table.txt", as.is = TRUE, stringsAsFactors = FALSE, max = 20)

returns results. The following queries return result sets with zero observations (I have verified that there is at least one record with the the value '2011/04/01' in its 'DateTime' field).

sqlQuery(channel = cn, query = "SELECT DateTime FROM large_table.txt WHERE DateTime = {d '2011-04-01'}", as.is = TRUE, stringsAsFactors = FALSE)

sqlQuery(channel = cn, query = "SELECT DateTime FROM large_table.txt WHERE DateTime = '2011/04/01'", as.is = TRUE, stringsAsFactors = FALSE)

Can anyone tell me why the result set has no observations?

Any suggestions to improve my database connection or query are greatly appreciated. I prefer to use the RODBC package instead of another database-query package, but am willing to try something new if necessary.

Was it helpful?

Solution

I'd use the 'memisc' package for this. set up spss.system.file() importers for your 3 SPSS files, subset using the memisc version to load in to R just the variables that are needed, including handling user-missing levels.

There are some oddities around how SPSS files are set up - it is fairly frequent for SPSS files with numeric variables to have user-missings set, which may need sorting at the R end into an interval scaling with appropriate NAs.

OTHER TIPS

You can save yourself a lot of hassle by reading the SPSS file directly into R.

library(foreign)
dat <- read.spss("filename.sav", to.data.frame=TRUE)

This will return a data.frame that you can analyse using all of the normal R tools.

As a side benefit, you will also have the questionnaire text in the attributes of your data.frame.

See ?read.spss for more details.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top