Pregunta

I am using the following snippet to get links from the google search results for the "keyword" I give.

import mechanize
from bs4 import BeautifulSoup
import re


def googlesearch():
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.set_handle_equiv(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0')] 
    br.open('http://www.google.com/')   

    # do the query
    br.select_form(name='f')   
    br.form['q'] = 'scrapy' # query
    data = br.submit()
    soup = BeautifulSoup(data.read())
    for a in soup.find_all('a', href=True):
        print "Found the URL:", a['href']
googlesearch()

Since am parsing the search results HTML page to get links.Its getting all the 'a' tags.But what I need is to get only the links for the results.Another thing is when you see the output of the href attribute it gives something like this

Found the URL: /search?q=scrapy&hl=en-IN&gbv=1&prmd=ivns&source=lnt&tbs=li:1&sa=X&ei=DT8HU9SlG8bskgWvqIHQAQ&ved=0CBgQpwUoAQ

But the actual link present in href attitube is http://scrapy.org/

Can anyone point me the solution for the above two questions mentioned above??

Thanks in advance

¿Fue útil?

Solución

Get only the links for the results

The links you're interested in are inside the h3 tags (with r class):

<li class="g">
  <h3 class="r">
    <a href="/url?q=http://scrapy.org/&amp;sa=U&amp;ei=XdIUU8DOHo-ElAXuvIHQDQ&amp;ved=0CBwQFjAA&amp;usg=AFQjCNHVtUrLoWJ8XWAROG-a4G8npQWXfQ">
      <b>Scrapy</b> | An open source web scraping framework for Python
    </a>
  </h3>
  ..

You can find the links using css selector:

soup.select('.r a')

Get the actual link

URLs are in the following format:

/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ
     ^^^^^^^^^^^^^^^^^^^^

Actual url is in the q parameter.

To get the the entire query string, use urlparse.urlparse:

>>> url = '/url?q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'
>>> urlparse.urlparse(url).query
'q=http://scrapy.org/&sa=U&ei=s9YUU9TZH8zTkQWps4BY&ved=0CBwQFjAA&usg=AFQjCNE-2uiVSl60B9cirnlWz2TMv8KMyQ'

Then, use urlparse.parse_qs to parse the query string and extract the q parameter value:

>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q']
['http://scrapy.org/']
>>> urlparse.parse_qs(urlparse.urlparse(url).query)['q'][0]
'http://scrapy.org/'

Final result

for a in soup.select('.r a'):
    print urlparse.parse_qs(urlparse.urlparse(a['href']).query)['q'][0]

output:

http://scrapy.org/
http://doc.scrapy.org/en/latest/intro/tutorial.html
http://doc.scrapy.org/
http://scrapy.org/download/
http://doc.scrapy.org/en/latest/intro/overview.html
http://scrapy.org/doc/
http://scrapy.org/companies/
https://github.com/scrapy/scrapy
http://en.wikipedia.org/wiki/Scrapy
http://www.youtube.com/watch?v=1EFnX1UkXVU
https://pypi.python.org/pypi/Scrapy
http://pypix.com/python/build-website-crawler-based-upon-scrapy/
http://scrapinghub.com/scrapy-cloud

Otros consejos

Or you could use https://code.google.com/p/pygoogle/ which basically does the same thing.

And you can get links to results as well.

A snippet of output from sample query for 'stackoverflow':

*Found 3940000 results*
[Stack Overflow]
Stack Overflow is a question and answer site for professional and enthusiast 
programmers. It's 100% free, no registration required. Take the 2-minute tour
http://stackoverflow.com/

In your code example you were extracting all <a> tags from the HTML, not only related to organic results:

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

You're looking for this to grab links from organic results only:

# container with needed data: title, link, etc.
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'minecraft',
  'gl': 'us',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to make everything from scratch, bypass blocks, and maintain the parser over time.

Code to integrate to achieve your goal:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "minecraft",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Disclaimer, I work for SerpApi.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top