BeautifulSoup
cannot distinguish visible text from other text in the HTML markup. This particular website does a very good job of obfuscating the markup and makes web-scraping of the page more complex. You can try to understand what text is visible but it's not that easy since there are a lot of irrelevant elements being inserted that can be directly made invisible via style
or via the class
. Some of the IP
parts are in span
s, some of them are not a part of any tag.
One workaround would be to use Selenium
which can grab only visible
text from the element. For example, this code will print you all the IP
s in the particular table:
from selenium.webdriver.firefox import webdriver
browser = webdriver.WebDriver()
browser.get('https://www.hidemyass.com/proxy-list')
rows = browser.find_elements_by_xpath('//table[@id="listtable"]//tr')
for row in rows[1:]:
cells = row.find_elements_by_tag_name('td')
print cells[1].text
browser.close()
See also:
Hope that helps.