Read the data in from "myfile.dat"
, say, (or just start from L
below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location:
or start with ID:
. Then remove everything in those lines up to and including the last space. Create a group vector g
which identifies the group to which each component of v2
belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2
into those groups . Expand short components of s
by appropriately inserting an NA assuming that if its short that Location:
is missing. (We assume the first field and the ID
fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
Using the sample data in the post we get this from the last line:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"