Read out specific words out of complex xml

Question

Instead of an array, you can also approach this with an Iterator that encapsulates the logic to traverse the descriptions HTML for the meals. It's simple to use as it sheds away the complexity of doing the parsing.

Here is an example followed by the output:

$uri = 'http://www.studentenwerk-berlin.de/speiseplan/rss/htw_wilhelminenhof/tag/lang/0000000000000000000000000';
$rss = simplexml_load_file($uri);
$meals = new MealIterator($rss->channel->item->description, 'Salate');
foreach ($meals as $entry) {
    vprintf("%s - %s\n", $entry);
}

Output:

Große Salatschüssel mit gekochtem Ei - EUR 1.55 / 2.50 / 3.25
Kleine Salatschale - EUR 0.55 / 0.90 / 1.15
Doppelt-Große Salatschale - EUR 2.95 / 4.70 / 6.20
Große Salatschale - EUR 1.55 / 2.50 / 3.25

The iterator makes use of PHP's built in DOM functionality, namely DOMDocument and DOMXpath. The first step is to obtain the table that contains one meal per each row. This is done with xpath in the constructor already:

public function __construct($html, $meal)
{
    $doc   = $this->createHtmlDoc($html);
    $xpath = new DOMXPath($doc);
    $expr  = sprintf('//th[.=%s]/../../following-sibling::tr', $this->xpathString($meal));
    $items = $xpath->query($expr);
    if ($items === FALSE) {
        throw new UnexpectedValueException('Failed to query the HTML document');
    }
    parent::__construct($items);
}

The key power to use here is Xpath. It will return a result that is one <tr> each containing one meal.

Still the data of each meal needs to be extracted. This is done in the current method of the iterator then:

public function current()
{
    $entry = parent::current();
    $tds   = $entry->getElementsByTagname('td');
    $name  = $this->childTextContent($tds->item(0));
    $price = trim($tds->item(1)->textContent);
    return compact("name", "price");
}

This is using merely DOMElement traversal methods (documented in the manual) and as this was a bit harder to parse, another quickly written helper method fetching only direct child text nodes content for the name of the meal:

private function childTextContent(DOMNode $node)
{
    $buffer = '';
    foreach ($node->childNodes as $child) {
        if ($child instanceof DOMText) {
            $buffer .= $child->textContent;
        }
    }
    return trim($buffer);
}

(You can see the full code of the iterator.)

Key points in this solution:

Encapsulate the parsing in an iterator - if the source changes, the parsing might change as well - but not the whole program.
Re-use existing libraries like simplexml and the sister library domdocument.
Solve the problem by dividing from big into small.

If you now say, you want to have an iterator instead of an array, it's pretty close, convert the iterator into an array:

print_r(iterator_to_array($meals, false));

Array
(
    [0] => Array
        (
            [name] => Große Salatschüssel mit gekochtem Ei
            [price] => EUR 1.55 / 2.50 / 3.25
        )

    [1] => Array
        (
            [name] => Kleine Salatschale
            [price] => EUR 0.55 / 0.90 / 1.15
        )

    [2] => Array
        (
            [name] => Doppelt-Große Salatschale
            [price] => EUR 2.95 / 4.70 / 6.20
        )

    [3] => Array
        (
            [name] => Große Salatschale
            [price] => EUR 1.55 / 2.50 / 3.25
        )

)

The routine to create an xpath string is from: Mitigating XPath Injection Attacks in PHP