WK,MND,CS,SHP,RevCY,RevLY,TCY,TLY,ACY,ALY

"2,JAN,GER,""Victoria's Secrets"",29307,25419,841,768,2320,1755"

2,JAN,KAP,Brand Shop,2027,-,95,0,175,-0

2,JAN,KAP,Kapp‚ Drugstore West,89768,78824,3309,3052,6197,5634

2,JAN,KAP,Kapp‚ P&C Centraal,680019,640951,8709,8116,19450,18385

2,JAN,KAP,Kapp‚ Sunglasses Centraal,49216,43940,464,421,550,478

2,JAN,KAP,Kapp‚ Sunglasses Schengen,25721,26592,306,318,333,378

2,JAN,KAP,Kapp‚ Sunglasses West,50280,53089,477,510,566,_78

I always seem to struggle getting the data into the right structure. I have the above-mentioned data structure (the file has over 10K rows). When loading it I want the columns to have specific data classes.

When I type:

RIS <- read.table("RIS.txt", sep=",", header=T, fill=T, 
    colClasses=c("integer", "character", "factor", "factor", rep("numeric",6)))

I get an error message:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
    scan() expected 'an integer', got '"2'

I think this is because the column WK actually contains messy symbols. But this can be the case in other columns as well.

Can anyone help me getting this data correctly loaded and "clean" the dataset in order to get it into the right format or class?

有帮助吗?

解决方案

You have a typical data cleansing problem - in my experience, 80% of the project time for a typical analytical task gets consumed by data preparation.

Given your data sample, try the following:

  • Use read.csv() with the argument quote="". This will ignore all of your quote marks - but of course you may have to remove these later.
  • Use a regular expression to remove any garbage characters in numeric columns (e.g. " or _) and then coerce into numeric.

Try this:

data <- "
WK,MND,CS,SHP,RevCY,RevLY,TCY,TLY,ACY,ALY
\"2,JAN,GER,\"\"Victoria's Secrets\"\",29307,25419,841,768,2320,1755\"
2,JAN,KAP,Brand Shop,2027,-,95,0,175,-0
2,JAN,KAP,Kapp‚ Drugstore West,89768,78824,3309,3052,6197,5634
2,JAN,KAP,Kapp‚ P&C Centraal,680019,640951,8709,8116,19450,18385
2,JAN,KAP,Kapp‚ Sunglasses Centraal,49216,43940,464,421,550,478
2,JAN,KAP,Kapp‚ Sunglasses Schengen,25721,26592,306,318,333,378
2,JAN,KAP,Kapp‚ Sunglasses West,50280,53089,477,510,566,_78
"

Now read the data:

x <- read.csv(text=data, quote="", header=TRUE)

Start the cleaning process:

numericCols <- c(1, 5:10)
x[numericCols] <- lapply(x[numericCols], function(x)as.numeric(gsub("[-_\"]", "", x)))
x

The result:

  WK MND  CS                       SHP  RevCY  RevLY  TCY  TLY   ACY   ALY
1  2 JAN GER    ""Victoria's Secrets""  29307  25419  841  768  2320  1755
2  2 JAN KAP                Brand Shop   2027     NA   95    0   175     0
3  2 JAN KAP      Kapp‚ Drugstore West  89768  78824 3309 3052  6197  5634
4  2 JAN KAP        Kapp‚ P&C Centraal 680019 640951 8709 8116 19450 18385
5  2 JAN KAP Kapp‚ Sunglasses Centraal  49216  43940  464  421   550   478
6  2 JAN KAP Kapp‚ Sunglasses Schengen  25721  26592  306  318   333   378
7  2 JAN KAP     Kapp‚ Sunglasses West  50280  53089  477  510   566    78
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top