Domanda

I am interacting with a search engine programmatically and I need to trick it into thinking that I am a human making queries, as opposed to a robot. This involves generating queries for which it seems plausible that any ordinary user would search for, like "ncaa football schedule" or "When was the lunar landing?" I'll be making over a thousand of these queries daily, and searching for random words out of a dictionary won't cut it, since that's not a very typical search habit.

So far I have thought of a few ways to generate realistic queries:

  • Obtain a list of the top google (or Yahoo or Bing, etc) searches for the day
  • Make use of Google's autocomplete feature by entering a random word from the dictionary followed by a space and scraping the recommended queries.

The latter approach sounds like it would involve a lot of reverse engineering. And with the former approach, I've been unable to find a list of more than 80-or-so queries - the only sources I've found are AOL trends (50-100) and Google Trends (30).

How might I go about generating a large set of human-like search phrases?
(For any language-dependent answers: I'm programming in Python)

È stato utile?

Soluzione

Although this most likely breaks Google's TOS, you can scrape the autocomplete data easily:

import requests
import json

def autocomplete(query, depth=1, lang='en'):
    if depth == 0:
        return

    response = requests.get('https://clients1.google.com/complete/search', params={
        'client': 'hp',
        'hl': lang,
        'q': query
    }).text

    data = response[response.index('(') + 1:-1]
    o = json.loads(data)

    for result in o[1]:
        suggestion = result[0].replace('<b>', '').replace('</b>', '')
        yield suggestion

        if depth > 1:
            for s in autocomplete(suggestion, depth - 1, lang):
                yield s

autocomplete('a', depth=2) gives you the top 110 queries that start with a (with some duplicates). Scrape each letter to a depth of 2, and you should have a ton of legitimate queries to choose from.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top