I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).

Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!

The problem is when I use gscholar.py as explained in the README file, I get as a result

query() takes at least 2 arguments (1 given).

Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get

query() takes at least 2 arguments (2 given).

This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?

I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!

有帮助吗?

解决方案

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.

For accessing websites with a browser information, you could use an url opener class with a custom user agent:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

And then download the required url as follows:

openurl(url).read()

For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query} url.

To extract pieces of information from a retrieved HTML file, you could use this piece of code:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

This piece of code extracts a concrete div element that contains number of results shown in a Google Scholar search results page.

其他提示

Google will block you... as it will be apparent you aren't a browser. Namely, they will detect the same request signature occurring too frequently compared with a reasonable human activity.

You can do:


Edit 2020:

You might want to check scholarly

>>> search_query = scholarly.search_author('Marty Banks, Berkeley')
>>> print(next(search_query))
{'_filled': False,
 'affiliation': 'Professor of Vision Science, UC Berkeley',
 'citedby': 17758,
 'email': '@berkeley.edu',
 'id': 'Smr99uEAAAAJ',
 'interests': ['vision science', 'psychology', 'human factors', 'neuroscience'],
 'name': 'Martin Banks',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=Smr99uEAAAAJ'}

It looks like scraping with Python and R runs into the problem where Google Scholar sees your request as a robot query due to a lack of a user-agent in the request. There is a similar question in StackExchange about downloading all pdfs linked from a web page and the answer leads the user to wget in Unix and the BeautifulSoup package in Python.

Curl also seems to be a more promising direction.

COPython looks correct but here's a bit of an explanation by example...

Consider f:

def f(a,b,c=1):
    pass

f expects values for a and b no matter what. You can leave c blank.

f(1,2)     #executes fine
f(a=1,b=2) #executes fine
f(1,c=1)   #TypeError: f() takes at least 2 arguments (2 given)

The fact that you are being blocked by Google is probably due to your user-agent settings in your header... I am unfamiliar with R but I can give you the general algorithm for fixing this:

  1. use a normal browser (firefox or whatever) to access the url while monitoring HTTP traffic (I like wireshark)
  2. take note of all headers sent in the appropriate http request
  3. try running your script and also note the headings
  4. spot the difference
  5. set your R script to make use the headers you saw when examining browser traffic

here is the call signature of query()...

def query(searchstr, outformat, allresults=False)

thus you need to specify a searchstr AND an outformat at least, and allresults is an optional flag/argument.

You may want to use Greasemonkey for this task. The advantage is that google will fail to detect you as a bot if you keep the request frequency down in addition. You can also watch the script working in your browser window.

You can learn to code it yourself or use a script from one of these sources.

You can use google-search-results package to extract data from Google Scholar. It uses SerpApi, which is a paid API with a free trial.

Full example

from serpapi import GoogleSearch
import os

params = {
    "engine": "google_scholar",
    "q": "coffee",
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results\n")

for result in data['organic_results']:
    print(f"""Title: {result['title']}
Result ID: {result['result_id']}
Link: {result['link']}

Snippet: {result['snippet']}
""")

    if 'resources' in result:
        print(f"Resource: {result['resources'][0]}")

Response

{
  "organic_results": [
    {
      "position": 0,
      "title": "Phenolic compounds in coffee",
      "result_id": "re9ssrU-exUJ",
      "type": "Html",
      "link": "http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext",
      "snippet": "Phenolic compounds are secondary metabolites generally involved in plant adaptation to environmental stress conditions. Chlorogenic acids (CGA) and related compounds are the main components of the phenolic fraction of green coffee beans, reaching levels up to …",
      "publication_info": {
        "summary": "A Farah, CM Donangelo - Brazilian journal of plant physiology, 2006 - SciELO Brasil"
      },
      "resources": [
        {
          "title": "scielo.br",
          "file_format": "HTML",
          "link": "http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=re9ssrU-exUJ",
        "html_version": "https://scholar.google.comhttp://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext",
        "cited_by": {
          "total": 608,
          "link": "https://scholar.google.com/scholar?cites=1547899847035383725&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=1547899847035383725&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:re9ssrU-exUJ:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 6,
          "link": "https://scholar.google.com/scholar?cluster=1547899847035383725&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=1547899847035383725&engine=google_scholar&hl=en&q=Coffee"
        },
        "cached_page_link": "https://scholar.google.comhttp://scholar.googleusercontent.com/scholar?q=cache:re9ssrU-exUJ:scholar.google.com/+Coffee&hl=en&as_sdt=0,44"
      }
    },
    {
      "position": 1,
      "title": "Functional properties of coffee and coffee by-products",
      "result_id": "9WouRiFbIK4J",
      "link": "https://www.sciencedirect.com/science/article/pii/S0963996911003449",
      "snippet": "Coffee, one of the most popular beverages, is consumed by millions of people every day. Traditionally, coffee beneficial effects have been attributed solely to its most intriguing and investigated ingredient, caffeine, but it is now known that other compounds also contribute to …",
      "publication_info": {
        "summary": "P Esquivel, VM Jiménez - Food Research International, 2012 - Elsevier",
        "authors": [
          {
            "name": "P Esquivel",
            "link": "https://scholar.google.com/citations?user=EpwJXskAAAAJ&hl=en&oi=sra"
          },
          {
            "name": "VM Jiménez",
            "link": "https://scholar.google.com/citations?user=_P0h0B8AAAAJ&hl=en&oi=sra"
          }
        ]
      },
      "resources": [
        {
          "title": "uoregon.edu",
          "file_format": "PDF",
          "link": "https://pages.uoregon.edu/chendon/coffee_literature/2012%20Food%20Res.%20Int.,%20Uses%20for%20coffee%20waste.pdf"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=9WouRiFbIK4J",
        "cited_by": {
          "total": 531,
          "link": "https://scholar.google.com/scholar?cites=12547128760323697397&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=12547128760323697397&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:9WouRiFbIK4J:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 9,
          "link": "https://scholar.google.com/scholar?cluster=12547128760323697397&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=12547128760323697397&engine=google_scholar&hl=en&q=Coffee"
        }
      }
    },
    {
      "position": 2,
      "title": "Coffee constituents",
      "result_id": "xY3q9qnkN54J",
      "link": "https://books.google.com/books?hl=en&lr=&id=y0qA89vCr3MC&oi=fnd&pg=PT47&dq=Coffee&ots=pyKSUohpI7&sig=8qULQFDS2RydGAkXlRyVJoph4AU",
      "snippet": "Coffee has been for decades the most commercialized food product and most widely consumed beverage in the world. Since the opening of the first coffee house in Mecca at the end of the fifteenth century, coffee consumption has greatly increased all around the world …",
      "publication_info": {
        "summary": "A Farah - Coffee: Emerging health effects and disease …, 2012 - books.google.com"
      },
      "resources": [
        {
          "title": "academia.edu",
          "file_format": "PDF",
          "link": "http://www.academia.edu/download/52419982/IFTPressBook_Coffee_PreviewChapter.pdf"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=xY3q9qnkN54J",
        "cited_by": {
          "total": 255,
          "link": "https://scholar.google.com/scholar?cites=11400832400354872773&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=11400832400354872773&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:xY3q9qnkN54J:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 7,
          "link": "https://scholar.google.com/scholar?cluster=11400832400354872773&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=11400832400354872773&engine=google_scholar&hl=en&q=Coffee"
        }
      }
    }
  ]
}

Output

Organic results

Title: Phenolic compounds in coffee
Result ID: re9ssrU-exUJ
Link: http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext

Title: Functional properties of coffee and coffee by-products
Result ID: 9WouRiFbIK4J
Link: https://www.sciencedirect.com/science/article/pii/S0963996911003449

Title: Coffee constituents
Result ID: xY3q9qnkN54J
Link: https://books.google.com/books?hl=en&lr=&id=y0qA89vCr3MC&oi=fnd&pg=PT47&dq=coffee&ots=pyKSUokkMc&sig=sjDv_w50O-5_svJDJKPJ7hHJtRg

Title: All about coffee
Result ID: fGeQlvu-2_IJ
Link: https://books.google.com/books?hl=en&lr=&id=oJxpQX4ko7cC&oi=fnd&pg=PT1&dq=coffee&ots=Oih_E-45Y-&sig=KYyBOoSXwRdwOv5upyWwl0FzIq8

Title: Biotechnological potential of coffee pulp and coffee husk for bioprocesses
Result ID: Zu7aKNjvAUwJ
Link: https://www.sciencedirect.com/science/article/pii/S1369703X0000084X

Title: Biodiversity conservation in traditional coffee systems of Mexico
Result ID: pIjQPO7__AYJ
Link: https://conbio.onlinelibrary.wiley.com/doi/abs/10.1046/j.1523-1739.1999.97153.x

Title: Coffee flavor chemistry
Result ID: UwtLySK5iawJ
Link: https://books.google.com/books?hl=en&lr=&id=NQi1LYJxFvUC&oi=fnd&pg=PP13&dq=coffee&ots=dRSace3WYu&sig=5jyqtvqkL_jGDkWTLsLqksKiQUw

Title: Coffee and health: a review of recent human research
Result ID: fSVlrXX7dIUJ
Link: https://www.tandfonline.com/doi/abs/10.1080/10408390500400009

Title: M-Coffee: combining multiple sequence alignment methods with T-Coffee
Result ID: _3o-xhuGyg0J
Link: https://academic.oup.com/nar/article-abstract/34/6/1692/2401531

Title: Producing decaffeinated coffee plants
Result ID: VJySkcFsQ1EJ
Link: https://www.nature.com/articles/423823a

If you want more information, check out SerpApi documentation or live playground.

Disclosure: I work at SerpApi.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top