What is the best way to parse this flat text format in R?

https://stackoverflow.com/questions/16025342

04-04-2022
|

Question

1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771

2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403

3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757

37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547

43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587

I want to get the gene name (ZFP112, SEP15, MLL4), the Location field (if present), the ID field, and skip the other stuff. All the string utilities like scan() seem geared toward more regular data. The blank line between records is effectively the record separator. I can write this to disk and read it back in with readLines() but I'd prefer to do it from memory since I downloaded it over HTTP.

Solution

Read the data in from "myfile.dat", say, (or just start from L below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location: or start with ID:. Then remove everything in those lines up to and including the last space. Create a group vector g which identifies the group to which each component of v2 belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2 into those groups . Expand short components of s by appropriately inserting an NA assuming that if its short that Location: is missing. (We assume the first field and the ID fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.

L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)

g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)

Using the sample data in the post we get this from the last line:

  [,1]           [,2]      [,3]       
1 "ZFP112"       "19q13.2" "7771"     
2 "SEP15"        "1p31"    "9403"     
3 "MLL4"         "19q13.1" "9757"     
4 "LOC100509547" NA        "100509547"
5 "LOC100509587" NA        "100509587"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow