Question

Lately I've been working on a project in python that involves scraping a few websites for some proxies. The problem I'm running into with this is that when I try to scrape a certain well known proxy site, Beautiful Soup doesn't do what I expect when I ask it to find where the IPs are in the table of proxies. I'll attempt to scape for the IPs for each proxy, and I'll get outputs like this when I use Beautiful Soup's .get_text() method on the appropriate element.

...

.UbZT{display:none}
.f5fa{display:inline}
.Glj2{display:none}
.cUce{display:inline}
.zjUZ{display:none}
.GzLS{display:inline}
98120169.117.186373161218218.83839393101138154165203242 

...

Here's the element that I'm trying to parse (the td tag which contains the IP):

<td><span><style>
.lLXJ{display:none}
.qRCB{display:inline}
.qC69{display:none}
.V0zO{display:inline}
</style><span style="display: inline">190</span><span class="V0zO">.</span><span 
style="display:none">2</span><div style="display:none">20</div><span 
style="display:none">51</span><span style="display:none">56</span><div 
style="display:none">56</div><span style="display:none">61</span><span 
class="lLXJ">61</span><div style="display:none">61</div><span 
class="qC69">110</span><div 
style="display:none">110</div><span style="display:none">135</span><div 
style="display:none">135</div><span class="V0zO">221</span><span 
style="display:none">234</span><div style="display:none">234</div><span class="147">.
</span><span style="display: inline">29</span><div style="display:none">44</div><span 
style="display:none">228</span><span></span><span class="qC69">248</span>.<span 
style="display:none">7</span><span></span><span style="display:none">44</span><span 
class="qC69">44</span><span class="qC69">80</span><span></span><span 
style="display:none">85</span><span class="lLXJ">85</span><div 
style="display:none">85</div><span class="qC69">100</span><div 
style="display:none">100</div><span></span><span class="qC69">130</span><div 
style="display:none">130</div><div style="display:none">168</div>212<span 
style="display:none">230</span><span class="qC69">230</span><div 
style="display:none">230</div></span></td>  

The actual text of this element is simply the IP for the proxy.

Here's the snippet of my code:

# Hide My Ass
pages = ['https://www.hidemyass.com/proxy-list']

for page in pages:
    hidemyass = Soup(requests.get(page).text)
    rows = hidemyass.find_all(lambda tag:tag.name=='tr' and tag.has_attr('class'))
    for row in rows:
        fields = row.find_all('td')
        # get ip, port, and protocol for proxy
        ip = fields[1].get_text()            # <-- Here's the above td element
        port = fields[2].get_text()
        protocol = fields[6].get_text().lower()
        # store proxy in database
        db.add_proxy({'ip':ip,'port':port,'protocol':protocol})
        num_found += 1

Is there a correct way to parse this element so that the output won't get jumbled up like this? It seems intuitive that Beautiful Soup's .get_text() method would return exactly the text that is visible on the site, but I suppose that's not true. Thanks for any help or advice.

Was it helpful?

Solution

BeautifulSoup cannot distinguish visible text from other text in the HTML markup. This particular website does a very good job of obfuscating the markup and makes web-scraping of the page more complex. You can try to understand what text is visible but it's not that easy since there are a lot of irrelevant elements being inserted that can be directly made invisible via style or via the class. Some of the IP parts are in spans, some of them are not a part of any tag.

One workaround would be to use Selenium which can grab only visible text from the element. For example, this code will print you all the IPs in the particular table:

from selenium.webdriver.firefox import webdriver

browser = webdriver.WebDriver()
browser.get('https://www.hidemyass.com/proxy-list')

rows = browser.find_elements_by_xpath('//table[@id="listtable"]//tr')
for row in rows[1:]:
    cells = row.find_elements_by_tag_name('td')
    print cells[1].text

browser.close()

See also:

Hope that helps.

OTHER TIPS

I used this code to parse Hidemyass.com code some time ago (this is Perl and parsing HTML with regular expressions is a bad approach):

sub find_ip {

  my ($html) = @_;
  my $ip;

  my ($style_section) = $html =~ m{<style>(.+?)</style>};

  my (@bad_styles) = $style_section =~ m/

    \.(\w+)\s*\{display:\s*none\}
  /isxg;

  my $bad_styles = join("|", @bad_styles);

  $html =~ s{<div .+? </div>}{}isxg;
  $html =~ s{<span style="display:none">.+?</span>}{}g;
  $html =~ s{<style>.+?</style>}{};
  $html =~ s{^<span>|</span>$}{}g;
  $html =~ s{<span class="(?:$bad_styles)">.+?</span>}{}g;
  $html =~ s{</?[^>]+>}{}g;

  $ip = $html;

  return $ip;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top