How do I read information from text files?

https://stackoverflow.com//questions/23038367

21-12-2019
|

Question

I have hundreds of text files with the following information in each file:

*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24

Now, I want to read all these files, and "collect" Sen's Statistics from each file (eg. .24) and compile into one file along with the corresponding file names. I have to do it in R.

I have worked with CSV files but not sure how to use text files.

This is the code I am using now:

require(gtools)
GG <- grep("*.txt", list.files(), value = TRUE)
GG<-mixedsort(GG)
S <- sapply(seq(GG), function(i){
X <- readLines(GG[i])
grep("SEN SLOPE", X, value = TRUE)
})
spl <- unlist(strsplit(S, ".*[^.0-9]"))
SenStat <- as.numeric(spl[nzchar(spl)])
SenStat<-data.frame( SenStat,file = GG)
write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE)

The current code is not able to read all values correctly and giving this error:

Warning message:
NAs introduced by coercion

Also I am not getting the file names the other column of Output. Please help!

Diagnosis 1

The code is reading the = sign as well. This is the output of print(spl)

 [1] ""       "5.55"   ""       "-.18"   ""       "3.08"   ""       "3.05"   ""       "1.19"   ""       "-.32"  
[13] ""       ".22"    ""       "-.22"   ""       ".65"    ""       "1.64"   ""       "2.68"   ""       ".10"   
[25] ""       ".42"    ""       "-.44"   ""       ".49"    ""       "1.44"   ""       "=-1.07" ""       ".38"   
[37] ""       ".14"    ""       "=-2.33" ""       "4.76"   ""       ".45"    ""       ".02"    ""       "-.11"  
[49] ""       "=-2.64" ""       "-.63"   ""       "=-3.44" ""       "2.77"   ""       "2.35"   ""       "6.29"  
[61] ""       "1.20"   ""       "=-1.80" ""       "-.63"   ""       "5.83"   ""       "6.33"   ""       "5.42"  
[73] ""       ".72"    ""       "-.57"   ""       "3.52"   ""       "=-2.44" ""       "3.92"   ""       "1.99"  
[85] ""       ".77"    ""       "3.01"

Diagnosis 2

Found the problem I think. The negative sign is a bit tricky. In some files it is

SEN SLOPE =-1.07
SEN SLOPE = -.11

Because of the gap after =, I am getting NAs for the first one, but the code is reading the second one. How can I modify the regex to fix this? Thanks!

Solution

Assume "text.txt" is one of your text files. Read into R with readLines, you can use grep to find the line containing SEN SLOPE. With no further arguments, grep returns the index number(s) for the element where the regular expression was found. Here we find that it's the 11th line. Add the value = TRUE argument to get the line as it reads.

x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE =  .24"

To find all the .txt files in the working directory we can use list.files with a regular expression.

list.files(pattern = "*.txt")
## [1] "text.txt"

LOOPING OVER MULTIPLE FILES

I created a second text file, text2.txt with a different SEN SLOPE value to illustrate how I might apply this method over multiple files. We can use sapply, followed by strsplit, to get the spl values that are desired.

GG <- list.files(pattern = "*.txt")
S <- sapply(seq_along(GG), function(i){
    X <- readLines(GG[i])
    ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA)
    ## added 04/23/14 to account for empty files (as per comment)
})
spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))"))
## above regex changed to capture up to and including "=" and 
## surrounding space, if any - 04/23/14 (as per comment)
SenStat <- as.numeric(spl[nzchar(spl)])

Then we can put the results into a data frame and send it to a file with write.table

( SenStatDf <- data.frame(SenStat, file = GG) )
##   SenStat      file
## 1    0.46 text2.txt
## 2    0.24  text.txt

We can write it to a file with

write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE)

UPDATED 07/21/2014:

Since the result is being written to a file, this can be made much more simple (and faster) with

( SenStatDf <- cbind(
      SenSlope = c(lapply(GG, function(x){
          y <- readLines(x)
          z <- y[grepl("SEN SLOPE", y)]
          unlist(strsplit(z, split = ".*=\\s+"))[-1]
          }), recursive = TRUE),
      file = GG
 ) )
#      SenSlope file       
# [1,] ".46"   "test2.txt"
# [2,] ".24"   "test.txt"

And then written and read into R with

write.table(SenStatDf, "myFile.txt", row.names = FALSE)
read.table("myFile.txt", header = TRUE)
#   SenSlope      file
# 1     1.24 test2.txt
# 2     0.24  test.txt

OTHER TIPS

First make a sample text file:

cat('*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24',file='samp.txt')

Then read it in:

tf <- readLines('samp.txt')

Now extract the appropriate line:

sen_text <- grep('SEN SLOPE',tf,value=T)

And then get the value past the equals sign:

sen_value <- as.numeric(unlist(strsplit(sen_text,'='))[2])

Then combine these results for each of your files (no file structure mentioned in the original question)

If you're text files are always of that format, (eg Sen Slope is always on line 11) and the text is identical over all your files you can do what you need in just two lines.

char_vector <- readLines("Path/To/Document/sample.txt")
statistic <- as.numeric(strsplit(char_vector[11]," ")[[1]][5])

That will give you 0.24.

You then iterate over all your files via an apply statement or a for loop.

For clarity:

> char_vector[11]
[1] "SEN SLOPE =  .24"

and

> strsplit(char_vector[11]," ")
[[1]]
[1] "SEN"   "SLOPE" "="     ""      ".24"

Thus you want [[1]] [5] of the result from strsplit.

Step1: Save complete fileNames in a single variable:

fileNames <- dir(dataDir,full.names=TRUE)

Step2: Lets read and process one of the files, and ensure that it is giving correct results:

data.frame(
  file=basename(fileNames[1]), 
  SEN.SLOPE= as.numeric(tail(
    strsplit(grep('SEN SLOPE',readLines(fileNames[1]),value=T),"=")[[1]],1))
  )

Step3: Do this on all the fileNames

do.call(
  rbind,
  lapply(fileNames, 
         function(fileName) data.frame(
           file=basename(fileName), 
           SEN.SLOPE= as.numeric(tail(
             strsplit(grep('SEN SLOPE',
                           readLines(fileName),value=T),"=")[[1]],1)
             )
           )
         )
  )

Hope this helps!!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow