DOMDocument : access the next following tag in PHP

https://stackoverflow.com/questions/18237921

24-06-2022
|

Question

I have installed a JSON plugin and got the content of HTML page. Now I want to parse and find a particular table, which has only class, but no id. I parse it using the PHP class DOMDocument.I have the idea to access the tag before the table and after that somehow to access the next following tag(my table) using DOMDocument. Example:

<a name="Telefonliste" id="Telefonliste"></a>
<table class="wikitable">

So, i get fist the <a> and after that I get <table>.

I have got all the tables using the following commands and especially getElementsByTagName(). After that I can access item(2) where my table is:

        $dom = new DOMDocument();

//load html source
$html = $dom->loadHTML($myHtml);

//discard white space
$dom->preserveWhiteSpace = false;

//the table by its tag name
$table = $dom->getElementsByTagName('table');
        $rows = $table->item(2)->getElementsByTagName('tr');

This way is ok, but I want to make it more general, because now I know that the table is located in item(2), but the location can be changed e.g if a new table is included in the HTML page before my table. My table will not be in item(2), but in item(3). So, I want it it to parse in a way that I can still reach this table without changing something in my code. Can I do it using DOMDocument as a DOM parser?

La solution

You can use DOMXPath, and make the expression as general as you need it.

For example:

$dom = new DOMDocument();

//discard white space
$dom->preserveWhiteSpace = false;

//load html source
$dom->loadHTML($myHtml);

$domxpath = new DOMXPath($dom);
$table = $domxpath->query('//table[@class="wikitable" and not(@id)][0]')->item(0);
$elementBeforeTable = $table->previousSibling;
$rows = $table->getElementsByTagName('tr');

Autres conseils

I've started writing a simple extension of this for the purpose of web scraping. I'm not 100% on the direction I want to take with it yet, but you can see an example of how to get the original HTML back in the response of the search rather than just raw text.

https://github.com/WolfeDev/PageScraper

EDIT: I plan on implementing basic table parsing soon.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow