Importazione dei dati associati a un determinato numero di CAS dal sito Web NIST WebBook in r

https://stackoverflow.com//questions/20004779

20-12-2019
|

Domanda

Vorrei recuperare le informazioni associate a un determinato numero di registro CAS (servizio di astratto chimico NR) dal sito Web NIST Webbook in R, utilizzando l'API fornita.

E.G. per CAS NR. "19431-79-9" (Caryophylladienolo II), http:// Webbook. nist.gov/cgi/cbook.cgi?id=19431-79-9&units=si&mask=2000#GAS-Chrom Sono arrivato fino a

casno = "19431-79-9"
casno2 = gsub("-", "", casno)
raw=readLines(paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep=""))

# mass spec, empty here, but not e.g. for casno2="630035" 
casno2="630035"
jcampfile = readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?JCAMP=C",casno2,"&Index=0&Type=Mass",sep=""))
if (jcampfile[[1]]=="##TITLE=Spectrum not found.") jcampfile=NA              

casno2 = gsub("-", "", casno)
# molecular stucture
molfile2d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str2File=C",casno2,sep=""))
if (molfile2d==character(0)) molfile2d=NA
molfile3d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str3File=C",casno2,sep=""))
if (molfile3d==character(0)) molfile3d=NA

Dai seguenti bit dell'output grezzo, vorrei quindi estrarre le seguenti variabili ed elenchi:

"name=\" Top \">Caryophylladienol II</a></h1>" 
-> name="Caryophylladienol II"

"Formula</a>:</strong> C<sub>15</sub>H<sub>24</sub>O</li>\n \n \n<li><strong>" 
-> formula="C15H24O"

"Molecular weight</a>:</strong> 220.3505</li>\n \n \n<li>" 
-> MW=220.3505

"IUPAC Standard InChI:</strong>\n \n<br /><table>\n<tr><td>\n<ul style=\" list-style-type: circle;\">\n<li><tt>InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1</tt></li>\n" 
-> InChI="InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1" 

"IUPAC Standard InChIKey:</strong>\n<tt>CIIYOYPOMGIECX-JXQTWKCFSA-N</tt>" 
-> InChiKey="CIIYOYPOMGIECX-JXQTWKCFSA-N"

"Stereoisomers:....<strong>
-> stereoisomers=XXX (list of stereoisomers)

"Other names:...\n"
-> synonyms=XXX (list of synonyms)

"Normal alkane RI..."
-> list of measured RIs plus on which column they were measured
e.g. here RIs=c(1637,1631,1627,1656,1615,1638,1628,1602,1611,1635,1622,1622,1627); columns=c("HP-5 MS","DB-5","RTX-1","Col-Elite 5MS","DB-5","DB-5","DB-5","DB-1","DB-5","CP Sil 5 CB","BP-1","RTX-1","DB-5")

Qualche idea su come vorrei meglio a quest'ultimo tipo di analisi? Idealmente questo dovrebbe essere tutto avvolto in una funzione che effettua un elenco di CAS NRS come input, lo annota utilizzando informazioni dal Webbook NIST e scrive in un file di testo. Ma non c'è bisogno di averlo così lucido - qualsiasi cosa per farmi iniziare avrebbe aiutato davvero!

Modifica: ho cercato di analizzare il file HTML utilizzando HTMLTreeParsese in Package XML, ma non sono abbastanza riuscito. Qualcuno con un po 'più di esperienza con quella funzione sia in grado di aiutarmi un po' per caso?

Modifica: ho capito una soluzione per importare i dati in Mathematica, vedere https://mathematica.stackexchange.com/questions/37091/Look-up-info-associated-with-a-Given-Cas-ChEmical -Identifier -tal-the-nist-webbo . Se qualcuno avrebbe l'abilità per portare quel codice a r per favore fammi sapere!

Soluzione

Per la prima stringa URL nella tua domanda, prova

casno = "19431-79-9"
url <- paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep="")
doc <- htmlParse(url)

name <- xpathSApply(doc, "//a[@id='Top']", xmlValue)
name
[1] "Caryophylladienol II"

Afferra tutte le liste con un titolo audace (un po 'di uscita troncata per display)

x <- xpathSApply(doc, "//li/strong/..", xmlValue)
x

[1] "Formula: C15H24O" 
[2] "Molecular weight: 220.3505" 
[3] "IUPAC Standard InChI:\n\n\nInChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2, ...
[4] "IUPAC Standard InChIKey:\nCIIYOYPOMGIECX-JXQTWKCFSA-N" 
[5] "CAS Registry Number: 19431-79-9"  
[6] "Chemical structure: \nThis structure is also available as a 2d Mol file\n
[7] "Species with the same structure:\nCaryophylla-4(14), 8(15)-dien-5-ol\n\n"
[8] "Stereoisomers:\nCaryophylladienol I\nCaryophylla-3(15),7(14)-dien-6-ol\n«alpha»-Caryophylladienol\nExo methylene ...
[9] "Other names:\nCaryophylla-4(14),8(15)-dien-5«alpha»-ol;\nCaryophylla-2(12),6(13)-dien-5-«alpha»-ol;\nCaryophylla ...
[10] "Information on this page:\nGas Chromatography\nReferences\nNotes / Error Report\n\n"
[11] "Options:\nSwitch to calorie-based units\n\n"

Se stai scrivendo solo a un file, è possibile risolvere l'elenco delimitato nell'elemento 8 (sostituire le nuove linee con punto e virgola) e rimuovere le nuove linee restanti.

x <- gsub(":\n", ": ", x) 
x[8] <- gsub("\n+", ";", x[8])
x <- gsub("\n", "", x)
x <- gsub("Download the identifier in a file.", "", x)

Utilizzare readhtmltable per tables

y <-readHTMLTable(doc, stringsAsFactors=FALSE)

Quindi contare le righe per trovare la tabella corretta e ottenere valori

sapply(y, nrow)
NULL NULL NULL NULL NULL NULL 
   1    1    5   13    6    1 

y[[4]][,2:3]
    Active phase     I
1        HP-5 MS 1637.
2        DB-5 MS 1631.
3          RTX-1 1627.
4  Col-Elite 5MS 1656.
5           DB-5 1615.
...

ri <- paste0(gsub(".", "", y[[4]][,3], fixed=TRUE), "=", y[[4]][,2], collapse=";")
ri
[1] "1637=HP-5 MS;1631=DB-5 MS;1627=RTX-1;1656=Col-Elite 5MS;1615=DB-5;1638=DB-5;1628=DB-5;1602=DB-1;1611=DB-5;1635=CP Sil 5 CB;1622=BP-1;1622=RTX-1;1627=DB-5"

Infine, combina e scrivi a un file

cas <- c(paste("Name:", name), x[c(1:5,7:9)], paste("RI:", ri) )
write( cas, file="cas.out")

Ci sono altri modi per afferrare i valori in elenchi non ordinati, ad esempio, per ottenere tutti gli stereoisomeri come vettore ...

stereo <- xpathSApply(doc, "//li/strong[text()='Stereoisomers:']/../ul/li/a", xmlValue)
 [1] "Caryophylladienol I"                       "Caryophylla-3(15),7(14)-dien-6-ol"         "«alpha»-Caryophylladienol"                
 [4] "Exo methylene isomer of Caryophyllenol I"  "«beta»-Caryophylla-4(14),8(15)-dien-5-ol"  "Caryophylla-4(12),8(13)-dien-5-«beta»-ol" 
 [7] "Caryophylla-4,8-dien-5-ol"                 "Caryophylla-4(12),8(13) diene 5 «beta»-ol" "Caryophyla-4(14),8(15)-dien-5-ol"         
[10] "Caryophylla-4(12).8(13)-diene-5«beta»-ol"  "2(12),6(13)-Caryophylladien-5-ol"

e quindi scrivere più righe su un file invece.

paste("Stereoisomer:", stereo)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow