Question

I'm new to Regular Expressions and things like that. I have only few knowledge and I think my current problem is about them.

I have a webpage, that contains text. I want to get links from the webpage that are only in SPANs that have class="img".

I go through those steps.

  1. grab all the SPANs tagged with the "img" class (this is the hard step that I'm looking for)
  2. move those SPANs to a new variable
  3. Parse the variable to get an array with the links (Each SPAN has only 1 link, so this will be easy)

I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion? Thanks :D

Was it helpful?

Solution

Use PHPs DOMDocument-class in combination with the DOMXPath-class to navigate to the elements you need, like this:

<?php
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents('http://foo.bar'));
$xpath = new DOMXPath($dom);

$elements = $xpath->query("/html/body//span[@class='img']//a"); foreach ($elements as $a) { echo $a->getAttribute('href'), "\n"; }

You can learn more about the XPath Language on the W3C page.

OTHER TIPS

A pattern like <span.* class="img".*>([^<]*)</span> should work fine., assuming your code looks something like

<span class="img">http://www.img.com/img.jpg</span>
<span alt="yada" class="img">animage.png</span>
<span alt="yada" class="img" title="still works">link.txt</span>
<span>not an img class</span>


<?php

$pattern = '@<span.* class="img".*>([^<]*)</span>@i';

//$subject = html code above

preg_match_all($pattern, $subject, $matches);

print_r($matches);

?>

I'm using PHP, but any other language doesn't matter, I'm looking how to deal with the first step. Any one have a suggestion?

We-e-ell...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen(url).read()
sieve = SoupStrainer(name='span', attrs={'class': 'img'})
tag_soup = BeautifulSoup(html, parseOnlyThese=sieve)
for link in tag_soup('a'):
    print link['href']

(that's python, using BeautifulSoup - should work on most douments, well-formed or no).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top