With Google CSE you specify the site via your CSE configuration (corresponding to your 'cx' parameter) not via the 'site:' query parameter. In the 'basics' tab of your CSE you should see a section called "Sites to search".
How to check if a url is indexed by google using Google Custom search API and Python?
-
29-06-2023 - |
문제
i need to check if some URLs are indexed by google using a python script and google custom search. I'd like to obtain in the script the same results i obtain when from my browser i google for site:www.example.it. My code is:
import urllib2
import json
import pprint
data = urllib2.urlopen('https://www.googleapis.com/customsearch/v1?key=AIzaSyA3xNw1doOc4rjoUGc7sq1gltQvOgalHqA&cx=017576662512468239146:omuauf_lfve&q=site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1')
data=json.load(data)
print data
The output of this is:
{ u'kind': u'customsearch#search',
u'queries': { u'request': [ { u'count': 10,
u'cx': u'017576662512468239146:omuauf_lfve',
u'inputEncoding': u'utf8',
u'outputEncoding': u'utf8',
u'safe': u'off',
u'searchTerms': u'site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'title': u'Google Custom Search - site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1',
u'totalResults': u'0'}]},
u'searchInformation': { u'formattedSearchTime': u'0.55',
u'formattedTotalResults': u'0',
u'searchTime': 0.552849,
u'totalResults': u'0'},
u'url': { u'template': u'https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json',
u'type': u'application/json'}}
As you can see there are no "items" while if you google for site:http://www.repubblica.it/politica/2014/04/07/news/governo_e_patto_su_italicum_brunetta_a_renzi_riforma_elettorale_entro_pasqua_o_si_dimetta-82947958/?ref=HREA-1 you have at least one item.
After various experiments it seems that google custom search doesn't work for the queries site:website.
Do you know any solution or alternative to this problem? Thanks.
해결책
다른 팁
Urls are in Excel file
import requests
import pandas as pd
import time
from bs4 import BeautifulSoup
from urllib.parse import urlencode
seconds = 3
proxies = {
'https' : 'https://localhost:8123',
'http' : 'http://localhost:8123'
}
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
headers = { 'User-Agent' : user_agent}
df = pd.read_excel('url_links.xlsx')
for i in range(0, len(df)):
line = df.loc[i,'links']
#print(line)
if line:
query = {'q': 'site:' + line}
google = "https://www.google.com/search?" + urlencode(query)
data = requests.get(google, headers=headers)
data.encoding = 'ISO-8859-1'
soup = BeautifulSoup(str(data.content), "html.parser")
try:
check = soup.find(id="rso").find("div").find("div").find("div").find("div").find("div").find("a")["href"]
print("URL is Index ")
except AttributeError:
print("URL Not Index")
time.sleep(float(seconds))
else:
print("Invalid Url")
제휴하지 않습니다 StackOverflow