Вопрос

I bring in a data set using the following command:

rbc <- read.csv("rbc hgb.csv", header = T)
data <- rbc[rbc$Result_Value_After != "NULL",]

For some reason the rbc$Result_Value_After gets treated like a factor, so I issue the following command:

data$Result_Value_After <- as.numeric(data$Result_Value_After)

The str(data) tells me the column is now of type num but all numbers that were factors are in decimal form like 7.2. When I do the conversion it gets changed to 72, which is way off. Any ideas on how to go about fixing this?

Это было полезно?

Решение

Here's a possible workaround for the issue of column classification upon calling read.csv.

Say I don't want to mess around with changing classes after reading data into R. If I want one column to be character and the others as the default class, I can use readLines to quickly read the first line of the .csv (i.e. the column header line, if present) and set up a vector to be passed to the colClasses argument of read.csv.

Here's a simple function,

col.classes <- function(csv, col, class){
    g <- readLines(csv, n = 1)
    n <- unlist(strsplit(g, ","))
    col.classes <- ifelse(n %in% col, class, NA)
    return(col.classes)
}

To show how this works, suppose I have a file named cats.csv (and it just so happens that I do), and I know I want the weight column to be class character and the rest of the columns as the default class. Keep in mind that colClasses can be a character vector, and for elements that are NA, the corresponding column of data is skipped and classed as if read without colClasses.

View the names of the columns in the file

names(read.csv('cats.csv'))
## [1] "cats"   "colour" "length" "weight" "mu" 

View the default classes from read.csv

> sapply(read.csv('cats.csv'), class)
##      cats    colour    length    weight       mu 
## "integer"  "factor" "integer" "integer" "integer" 

Sample Runs:

(1) Class the length column as numeric upon calling read.csv, while leaving others as their respective defaults

> cc1 <- col.classes('cats.csv', 'length', 'numeric')
> rr1 <- read.csv('cats.csv', colClasses = cc1)
> sapply(rr1, class)
## cats    colour    length    weight       mu 
## "integer"  "factor" "numeric" "integer" "integer" 

(2) Similarly, class the weight column as character

> cc2 <- col.classes('cats.csv', 'weight', 'character')
> rr2 <- read.csv('cats.csv', colClasses = cc2)
> sapply(rr2, class)
## cats      colour      length      weight         mu 
## "integer"    "factor"   "integer" "character"   "integer" 

Not sure if that helps you at all. I find it useful when I want a mixture of column classes that might otherwise be clunky and frustrating to change once the data has already been loaded into R.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top