Why am I getting repetitive output while trying to scrape data from Google Scholar?

Question 1

Change the following line in your code:

finRecord = recordPart1 + str(recordCount)

To

finRecord = recordPart1 + str(recordCount+urlCounter-10)

The real problem: div ids in the first page are gs_ggsW[0-9], but ids on the second page are gs_ggsW[10-19]. So beautiful soup will find no links on the 2nd page.

Python's variable scope may confuse people from other languages, like Java. After the for loop below being executed, the variable link still exists. So the link is referenced to the last link on the 1st page.

for link in soup1.find_all('a'):
    print(link.get('href'))

Updates:

Google may not provide pdf download links for some papers, so you can't use id to match the link of each paper. You can use css selecters to match all the links together.

soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
for link in soup.select('div.gs_ttss a'):
    print(link.get('href'))

Question 2

Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.

Code and example in the online IDE to extract PDF's:

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "entity resolution", # search query
    "hl": "en"                # language
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

for pdf_link in soup.select(".gs_or_ggsm a"):
  pdf_file_link = pdf_link["href"]
  print(pdf_file_link)


# output from the first page:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''

Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.

The main difference is that you only need to grab the data from structured JSON instead of figuring out how to extract the data from HTML, how to bypass blocks from search engines.

Code to integrate:

from serpapi import GoogleSearch

params = {
    "api_key": "YOUR_API_KEY",   # SerpApi API key
    "engine": "google_scholar",  # Google Scholar organic reuslts
    "q": "entity resolution",    # search query
    "hl": "en"                   # language
}

search = GoogleSearch(params)
results = search.get_dict()

for pdfs in results["organic_results"]:
    for link in pdfs.get("resources", []):
        pdf_link = link["link"]
        print(pdf_link)


# output:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''

If you want to scrape more data from organic results, there's a dedicated Scrape Google Scholar with Python blog post of mine.

Disclaimer, I work for SerpApi.