How can I get certain text from a website with Python?

https://stackoverflow.com/questions/22880882

28-06-2023
|

Question

I am using a python script to get a certain text from a website (http://www.opensiteexplorer.org/). For example trying this search: http://www.opensiteexplorer.org/links?site=www.google.com

I would like to get "Page Authority" and "Root Domains" and filter them out.I am using lxml.

I am using this code:

response = br.open( 'http://www.opensiteexplorer.org/links?site=' + blog)
tree = html.fromstring(response.read())
authority = int (tree.xpath('//span[@class="metrics-authority"]/text()')[1].strip())
if authority>1:
    print blog
    print 'This blog is ready to be registered'
    print authority
    f.write(blog +' '+ str(authority) +'\n')

Here I am filtering for a PA greater than 1 and I would like also to filter for Linking Root Domains greater than 5. How can I do that?

Solution

You can get all 2 spans with metrics-authority class, first one is a Domain Authority, second one is a Page Authority. Additionally, you can get Root Domains from the div with id="metrics-page-link-metrics":

import urllib2
from lxml import html

tree = html.parse(urllib2.urlopen('http://www.opensiteexplorer.org/links?site=www.google.com'))

spans = tree.xpath('//span[@class="metrics-authority"]')
data = [item.text.strip() for item in spans]
print "Domain Authority: {0}, Page Authority: {1}".format(*data)

div = tree.xpath('//div[@id="metrics-page-link-metrics"]//div[@class="has-tooltip"]')[1]
print "Root Domains: {0}".format(div.text.strip())

prints:

Domain Authority: 100, Page Authority: 97 
Root Domains: 680

Hope that helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow