I am trying to scrape the PDF links from the search results from Google Scholar. I have tried to set a page counter based on the change in URL, but after the first eight output links, I am getting repetitive links as output.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests


#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
    urlPart1 = "http://scholar.google.com/scholar?start="
    urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
    url = urlPart1 + str(urlCounter) + urlPart2
    page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
    resp = urllib2.urlopen(page)
    html = resp.read()
    soup = BeautifulSoup(html)
    urlCounter = urlCounter + 10

    recordCount = 0
    while recordCount <=9:
        recordPart1 = "gs_ggsW"
        finRecord = recordPart1 + str(recordCount)
        recordCount = recordCount+1

    #printing the links
        for link in soup.find_all('div', id = finRecord):
            linkstring = str(link)
            soup1 = BeautifulSoup(linkstring)
        for link in soup1.find_all('a'):
            print(link.get('href'))
有帮助吗?

解决方案

Change the following line in your code:

finRecord = recordPart1 + str(recordCount)

To

finRecord = recordPart1 + str(recordCount+urlCounter-10)

The real problem: div ids in the first page are gs_ggsW[0-9], but ids on the second page are gs_ggsW[10-19]. So beautiful soup will find no links on the 2nd page.

Python's variable scope may confuse people from other languages, like Java. After the for loop below being executed, the variable link still exists. So the link is referenced to the last link on the 1st page.

for link in soup1.find_all('a'):
    print(link.get('href'))

Updates:

Google may not provide pdf download links for some papers, so you can't use id to match the link of each paper. You can use css selecters to match all the links together.

soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
for link in soup.select('div.gs_ttss a'):
    print(link.get('href'))

其他提示

Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.

Code and example in the online IDE to extract PDF's:

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "entity resolution", # search query
    "hl": "en"                # language
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for pdf_link in soup.select(".gs_or_ggsm a"):
  pdf_file_link = pdf_link["href"]
  print(pdf_file_link)


# output from the first page:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''

Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.

The main difference is that you only need to grab the data from structured JSON instead of figuring out how to extract the data from HTML, how to bypass blocks from search engines.

Code to integrate:

from serpapi import GoogleSearch

params = {
    "api_key": "YOUR_API_KEY",   # SerpApi API key
    "engine": "google_scholar",  # Google Scholar organic reuslts
    "q": "entity resolution",    # search query
    "hl": "en"                   # language
}

search = GoogleSearch(params)
results = search.get_dict()

for pdfs in results["organic_results"]:
    for link in pdfs.get("resources", []):
        pdf_link = link["link"]
        print(pdf_link)


# output:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''

If you want to scrape more data from organic results, there's a dedicated Scrape Google Scholar with Python blog post of mine.

Disclaimer, I work for SerpApi.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top