Question

I am trying to read a file (ascii) in R using read.table The file looks like the following:

  DAILY MAXIMUM TEMPARATURE  
  YEAR DAY MT DT   LAT. 66.5   67.5   68.5   69.5   70.5
  1969 001 01 01   6.5  99.90  99.90  31.90  99.90  99.90 
  1969 001 01 01   7.5  99.90  20.90  99.90  99.90  23.90
  1969 001 01 01   8.5  99.90  99.90  30.90  99.90  18.90
  .....
  ..... 
  YEAR DAY MT DT   LAT. 66.5   67.5   68.5   69.5   70.5
  1969 001 01 02   6.5  21.90  99.90  99.90  99.90  99.90 
  1969 001 01 02   7.5  99.90  33.90  99.90  99.90  99.90
  1969 001 01 02   8.5  99.90  99.90  15.90  99.90  99.90
  .....
  .....
  YEAR DAY MT DT   LAT. 66.5   67.5   68.5   69.5   70.5
  1969 001 01 03   6.5  99.90  99.90  99.90  99.90  99.90 
  1969 001 01 03   7.5  99.90  99.90  99.90  99.90  99.90
  1969 001 01 03   8.5  99.90  99.90  99.90  99.90  99.90
  .....
  .....

I read it using:

inp=read.table("MAXT1969.TXT",skip=1,header=T)

The file has been read and the contents are in the variable inp.

I have 2 questions -

I. the command to see the first 5 columns gives some extra information along with the desired output, for example, inp[1,5] gives the following output:

> inp[1,5]
  "[1] 6.5
  33 Levels: 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 ... LAT."

I don't want the extra info but only the value. Where I am going wrong?

II. After every 32 rows, I've a header (YEAR DAY ....). How to ignore reading the header at regular intervals?

Was it helpful?

Solution

Try comment.char="Y" which will make read.table ignore all the lines starting with Y. stringsAsFactors=FALSE will avoid converting strings to factors.

inp <- read.table("MAXT1969.TXT", skip = 1, header=FALSE, comment.char="Y", stringsAsFactors=FALSE )

#Read just first row to get header names
cols <- read.table("MAXT1969.TXT", header=FALSE, skip=1, nrows=1  )
names(inp) <- cols  

inp
##   YEAR DAY MT DT LAT. 66.5 67.5 68.5 69.5 70.5
## 1 1969   1  1  1  6.5 99.9 99.9 31.9 99.9 99.9
## 2 1969   1  1  1  7.5 99.9 20.9 99.9 99.9 23.9
## 3 1969   1  1  1  8.5 99.9 99.9 30.9 99.9 18.9
## 4 1969   1  1  2  6.5 21.9 99.9 99.9 99.9 99.9
## 5 1969   1  1  2  7.5 99.9 33.9 99.9 99.9 99.9
## 6 1969   1  1  2  8.5 99.9 99.9 15.9 99.9 99.9
## 7 1969   1  1  3  6.5 99.9 99.9 99.9 99.9 99.9
## 8 1969   1  1  3  7.5 99.9 99.9 99.9 99.9 99.9
## 9 1969   1  1  3  8.5 99.9 99.9 99.9 99.9 99.9

#Since the stringsAsFactor = FALSE was used numbers were read correctly. 
inp[1, 5]
## [1] 6.5

OTHER TIPS

Question 1: This means that you value has been read as a factor, i.e. a categorical variable. Just use as.numeric on the column to transform it from factor to numeric. Alternatively, you can use the colClasses argument to read.table to directly specify the type of the columns in the file.

Question 2: You can read the lines using readLines, find the lines that start with YEAR using grep, delete those, and read this edited output into a data.frame using read.table(textConnection(edited_data)). I would use @geektrader's solution in stead, but I just wanted to add this for completeness sake.

Another solution would be to introduce NAs and then omit them -

inp = as.data.frame(na.omit(apply(apply(inp, 2, as.character), 2, as.numeric)))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top