How to lookup a url on a page
Question
I'm new to Regular Expressions and things like that. I have only few knowledge and I think my current problem is about them.
I have a webpage, that contains text. I want to get links from the webpage that are only in SPAN
s that have class="img"
.
I go through those steps.
- grab all the
SPAN
s tagged with the "img" class (this is the hard step that I'm looking for) - move those
SPAN
s to a new variable - Parse the variable to get an array with the links (Each
SPAN
has only 1 link, so this will be easy)
I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion? Thanks :D
Solution
Use PHPs DOMDocument-class in combination with the DOMXPath-class to navigate to the elements you need, like this:
<?php
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents('http://foo.bar'));
$xpath = new DOMXPath($dom);
$elements = $xpath->query("/html/body//span[@class='img']//a");
foreach ($elements as $a)
{
echo $a->getAttribute('href'), "\n";
}
You can learn more about the XPath Language on the W3C page.
OTHER TIPS
A pattern like <span.* class="img".*>([^<]*)</span>
should work fine., assuming your code looks something like
<span class="img">http://www.img.com/img.jpg</span>
<span alt="yada" class="img">animage.png</span>
<span alt="yada" class="img" title="still works">link.txt</span>
<span>not an img class</span>
<?php
$pattern = '@<span.* class="img".*>([^<]*)</span>@i';
//$subject = html code above
preg_match_all($pattern, $subject, $matches);
print_r($matches);
?>
I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion?
We-e-ell...
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer
html = urllib.urlopen(url).read()
sieve = SoupStrainer(name='span', attrs={'class': 'img'})
tag_soup = BeautifulSoup(html, parseOnlyThese=sieve)
for link in tag_soup('a'):
print link['href']
(that's python, using BeautifulSoup - should work on most douments, well-formed or no).