Question

At the moment, I'm trying to scrape forms from some sites using the following query:

select * from html 
where url="http://somedomain.com" 
and xpath="//form[@action]"

This returns a result like so:

{
    form: {
        action: "/some/submit",
        id: "someId",
        div: {
            input: [
               ... some input elements here
            ]
        }
        fieldset: {
            div: {
                input: [
                    ... some more input elements here
                ]
            }
        }
    }
}

On some sites this could go many levels deep, so I'm not sure how to begin trying to filter out the unwanted elements in the result. If I could filter them out here, then it would make my back-end code much simpler. Basically, I'd just like the form and any label, input, select (and option) and textarea descendants.

Here's an XPath query I tried, but I realised that the element hierarchy would not be maintained and this might cause a problem if there are multiple forms on the page:

//form[@action]/descendant-or-self::*[self::form or self::input or self::select or self::textarea or self::label]

However, I did notice that the elements returned by this query were no longer returned under divs and other elements beneath the form.

Was it helpful?

Solution

I don't think it will be possible in a plain query as you have tried.

However, it would not be too much work to create a new data table containing some JavaScript that does the filtering you're looking for.

Data table

A quick, little <execute> block might look something like the following.

var elements = y.query("select * from html where url=@u and xpath=@x", {u: url, x: xpath}).results.elements();
var results = <url url={url}></url>;
for each (element in elements) {
    var result = element.copy();
    result.setChildren("");
    result.normalize();
    for each (descendant in y.xpath(element, filter)) {
        result.node += descendant;
    }
    results.node += result;
}
response.object = results;

» See the full example data table.

Example query

use "store://VNZVLxovxTLeqYRH6yQQtc" as example;
select * from example where url="http://www.yahoo.com"

» See this query in the YQL console

Example results

Query results XML

Hopefully the above is a step in the right direction, and doesn't look too daunting.

Links

OTHER TIPS

This is how I would filter specific nodes but still allow the parent tag with all attributes to show:

//form[@name]/@* | //form[@action]/descendant-or-self::node()[name()='input' or name()='select' or name()='textarea' or name()='label']

If there are multiple form tags on the page, they should be grouped off by this parent tag and not all wedged together and unidentifiable.

You could also reverse the union if it would help how you'd like the nodes to appear:

//form[@action]/descendant-or-self::node()[name()='input' or name()='select' or name()='textarea' or name()='label'] | //form[@name]/@*
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top