Question

I know many posts have already answered similar questions like mine, but I've tried to figure it out for 2 days now and it seems as if I'm not seeing the picture here...

I got this csv file looking like this:

Werteformat:                wertabh. (Q)
Werte:  
01.01.76 00:00  0,363
02.01.76 00:00  0,464
...
31.12.10 00:00  1,03
01.01.11 00:00  Lücke

I wanna create a timeline with the data, but I can't import the csv properly.

I've tried this so far:

data<-read.csv2(file, 
            header = FALSE, 
            sep = ";", 
            quote="\"", 
            dec=",", 
            col.names=c("Datum", "Abfluss"), 
            skip=2, 
            nrows=length(strs)-2, 
            colClasses=c("date","numeric"))`

But then I get

"Fehler in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() erwartete 'a real', bekam 'L�cke'"

so i delete the colClasses and it works, I got rid of all unwanted rows. But: everything is in factors. So i use as.numeric

Abfluss1<-as.numeric(data$Abfluss)

Know i can calculate with Abfluss 1, but the values are totally different than in the original csv...

Abfluss1
    [1]   99  163  250  354  398  773  927  844  796  772 1010 1468 1091  955  962  933  881  844  803  772  773  803 1006  969  834  779  755
   [28]  743  739 

Where did I go wrong?! I really would appreciate some helpful hints. By the way, the files I'm working on can be downloaded here: http://ehyd.gv.at/#

Just click on one of these blue-ish triangles and download "Q-Tagesmittel"

Was it helpful?

Solution

First of all, there seems a problem with the file encoding. The downloaded file has obviously a Latin-encoding which is not correctly recognizes, why it says L�cke and not Lücke:

encoding = "latin1"

Secondly, Your example seems to be not reproducible: From my understanding you want to skip 28 lines (maybe I am wrong). And the variable strs is not declared in your example. From what I understood you want to skip 28 lines and leave the last one out so in total

nrows = length( readLines( file ) ) - 29

Finally you bumped into this common R issue: How to convert a factor to an integer\numeric without a loss of information?. The entire column is interpreted as character vector because not all elements could be interpreted as numeric. And when adding a character vector to a data.frame it is per default casted to a factor column. Although it is not necessary, if you specify the correct range of lines, you can avoid this with

stringsAsFactors = FALSE

So in total:

f <- readLines("Q-Tagesmittel-204586.csv")
df <- read.csv2(
  text   = f, 
  header = FALSE,
  sep = ";",
  quote="\"", 
  dec=",", 
  skip=28,
  col.names=c("Datum", "Abfluss"),
  nrows = length(f) -29,
  encoding = "latin1",
  stringsAsFactors = FALSE
)

Oh, and just in case you want to convert as next step the Datum column to a date object, one method to achieve this would be

df$Datum <- strptime( df$Datum, "%d.%m.%Y %H:%M:%S" )

str(df)
'data.frame':   12784 obs. of  2 variables:
 $ Datum  : POSIXlt, format: "1976-01-01" "1976-01-02" "1976-01-03" "1976-01-04" ...
 $ Abfluss: num  0.691 0.799 0.814 0.813 0.795 0.823 0.828 0.831 0.815 0.829 ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top