Question

i am doing a project in which i need to get the information from webpages. i am using python and ghost for it. i saw this code in the documentation:

links = gh.evaluate("""
                    var linksobj = document.querySelectorAll("a");
                    var links = [];
                    for (var i=0; i<linksobj.length; i++){
                        links.push(linksobj[i].value);
                    }
                    links;
                """)

this code is definitely not python. which language is it and where i can learn how to configure it? how can find a string from the tags eg. in:

title>this is title of the webpage
how can i get

this is title of the page

thanks.

Was it helpful?

Solution

ghost.py is a webkit client. It allows you to load a web page and interact with its DOM and runtime.

This means that once you have everything installed and running, you can simply do this:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://stackoverflow.com/')
if page.http_status == 200:
    result, extra = ghost.evaluate('document.title;')
    print('The title is: {}'.format(result))

OTHER TIPS

Use requests and beautifulSoup

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.google.com/")
soup = BeautifulSoup(r.text)
soup.title.string
In [3]: soup.title.string
Out[3]: u'Google'

Edit: After looking at the answer from Padraic Cunningham, it seems to me that I unfortunately misunderstood your question. Any how I leave my answer for future references or maybe for downvotes. :P

If the output you receive is a string then common string operations in python to achieve the desired output you mentioned in your question.

You receive : title>this is title of the webpage

You desire: this is title of the webpage

Assuming that the output you receive is always in the same format so you can do following string operation to get your desired output. Using split operation:

>>> s = 'title>this is title of the webpage'
>>> p = s.split('>')
>>> p
 ['title', 'this is title of the webpage']
>>> p[1]
'this is title of the webpage'

Here p is a list so you have to access its proper element that contains your desired output.

Or more simpler way is to make a substring.

>>> s = 'title>this is title of the webpage'
>>> p = s[6:]
>>> p
'this is title of the webpage'

p = s[6:] in the above code snippet means that you want a string which has all the contents of title>this is title of the webpage starting from 7th element to the end. In other words you are ignoring the first 6 elements.

If the output you receive is not always in the same format then you may prefer using regular expressions.

Your second question is already answered in the comments section. I hope I understood your questions correctly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top