Pregunta

I have successfully been able to get a pubmed results page in xml format and write the contents to a local file "Publications.xml". The problem is when I use simplexml_load_file("Publications.xml"), it fails. Not able to figure out why.

<?php
$feed = 'http://www.ncbi.nlm.nih.gov/pubmed?term=carl&sort=pubdate&report=xml';
$local = 'Publications.xml';
$curtime = time();
$filemodtime;
if( (!file_exists($local)) || (time() - filemtime($local)) > 86400 )
{
    $contents = file_get_contents($feed);
    $fp = fopen($local,"w");
    fwrite($fp, $contents);
    fclose($fp);
}
$xml = simplexml_load_file($local) or ("Can't");
?>

On the last but the second line the parser fails and I get the message "Can't". I have double checked the xml file and it appears to be in a good shape.

If anyone can let me know about any workarounds for this one, I will be very grateful. Here's a copy of the xml file the PHP script above tries to read (http://pastebin.com/U0fEKmZL):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
&lt;PubmedArticle&gt;
    &lt;MedlineCitation Status="Publisher" Owner="NLM"&gt;
        &lt;PMID Version="1"&gt;23314841&lt;/PMID&gt;
        &lt;DateCreated&gt;
            &lt;Year&gt;2013&lt;/Year&gt;
            &lt;Month&gt;1&lt;/Month&gt;
            &lt;Day&gt;14&lt;/Day&gt;
        &lt;/DateCreated&gt;
        &lt;Article PubModel="Print-Electronic"&gt;
            &lt;Journal&gt;
                &lt;ISSN IssnType="Electronic"&gt;1432-0932&lt;/ISSN&gt;
                &lt;JournalIssue CitedMedium="Internet"&gt;
                    &lt;PubDate&gt;
                        &lt;Year&gt;2013&lt;/Year&gt;
                        &lt;Month&gt;Jan&lt;/Month&gt;
                        &lt;Day&gt;12&lt;/Day&gt;
                    &lt;/PubDate&gt;

 ... (too long, see link)
¿Fue útil?

Solución

For some reason, the pubmed server is returning that entire XML file as an HTML file with a single <pre> tag containing the XML. It also contains multiple XML fragments (there's several <PubmedArticle> elements and no container around them). Clearly this is intended to be processed by some wacky custom code.

You could "unwrap" the XML by calling SimpleXML twice, like so:

$outer_xml = simplexml_load_file($local);
$inner_xml = simplexml_load_string('<dummyContainer>' . (string)$outer_xml . '</dummyContainer>');
foreach ( $inner_xml->PubmedArticle as $article )
{
    // etc
}

To explain:

  • the outer "XML document" is the HTML, which has a single outer element of <pre>
  • casting that to string (which I've done explicitly with (string) for clarity and good habit) will give you the contents of that <pre> tag, i.e. all the <PubmedArticle> elements
  • wrapping that content in a <dummyElement> tag will give you a valid XML document, with each of the <PubmedArticle> elements as a top-level child in the document

Otros consejos

Try urlencoding.

Note:

Libxml 2 unescapes the URI, so if you want to pass e.g. b&c as the URI parameter a, you have to call simplexml_load_file(rawurlencode('http://example.com/?a=' . urlencode('b&c'))). Since PHP 5.1.0 you don't need to do this because PHP will do it for you.

simplexml_load_file

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top