You have a typical data cleansing problem - in my experience, 80% of the project time for a typical analytical task gets consumed by data preparation.
Given your data sample, try the following:
- Use
read.csv()
with the argumentquote=""
. This will ignore all of your quote marks - but of course you may have to remove these later. - Use a regular expression to remove any garbage characters in numeric columns (e.g. " or _) and then coerce into numeric.
Try this:
data <- "
WK,MND,CS,SHP,RevCY,RevLY,TCY,TLY,ACY,ALY
\"2,JAN,GER,\"\"Victoria's Secrets\"\",29307,25419,841,768,2320,1755\"
2,JAN,KAP,Brand Shop,2027,-,95,0,175,-0
2,JAN,KAP,Kapp‚ Drugstore West,89768,78824,3309,3052,6197,5634
2,JAN,KAP,Kapp‚ P&C Centraal,680019,640951,8709,8116,19450,18385
2,JAN,KAP,Kapp‚ Sunglasses Centraal,49216,43940,464,421,550,478
2,JAN,KAP,Kapp‚ Sunglasses Schengen,25721,26592,306,318,333,378
2,JAN,KAP,Kapp‚ Sunglasses West,50280,53089,477,510,566,_78
"
Now read the data:
x <- read.csv(text=data, quote="", header=TRUE)
Start the cleaning process:
numericCols <- c(1, 5:10)
x[numericCols] <- lapply(x[numericCols], function(x)as.numeric(gsub("[-_\"]", "", x)))
x
The result:
WK MND CS SHP RevCY RevLY TCY TLY ACY ALY
1 2 JAN GER ""Victoria's Secrets"" 29307 25419 841 768 2320 1755
2 2 JAN KAP Brand Shop 2027 NA 95 0 175 0
3 2 JAN KAP Kapp‚ Drugstore West 89768 78824 3309 3052 6197 5634
4 2 JAN KAP Kapp‚ P&C Centraal 680019 640951 8709 8116 19450 18385
5 2 JAN KAP Kapp‚ Sunglasses Centraal 49216 43940 464 421 550 478
6 2 JAN KAP Kapp‚ Sunglasses Schengen 25721 26592 306 318 333 378
7 2 JAN KAP Kapp‚ Sunglasses West 50280 53089 477 510 566 78