Extract text between multilevel repetitive xml tags using Php

https://stackoverflow.com//questions/24022500

21-12-2019
|

Question

I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult> <Count>7117</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <QueryKey>1</QueryKey> <WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 </WebEnv> <IdList> <Id>24887359</Id> <Id>24884828</Id> <Id>24884718</Id> <Id>24884479</Id> <Id>24882343</Id> <Id>24879340</Id> <Id>24871662</Id> <Id>24870721</Id> <Id>24864115</Id> <Id>24863809</Id> </IdList> <TranslationSet/> <TranslationStack> <TermSet> <Term>BRCA1[tiab]</Term> . . . </TranslationStack> </eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>. Regex gets me just the first value out of the ten. preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids) the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids Any other suggestions to go about this ?

Solution

Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,

Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.

preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
   "<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
      echo $o;
      echo "\n";
}
?>

OTHER TIPS

You should use php's xpath capabilities, as explained here:

http://www.w3schools.com/php/func_simplexml_xpath.asp

Example:

<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>

XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.

use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo

Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.

If you want any Id element in the first IdList of the eSearchResult

/eSearchResult/IdList[1]/Id

As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.

You need to create an Xpath object for a DOM document

$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);

$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
  $result[] = trim($id->nodeValue);
}
var_dump($id);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow