ScraperWiki/Python: filtering out records when property is false

https://stackoverflow.com/questions/10437805

05-06-2021
|

문제

I'm using the following code on ScraperWiki to search Twitter for a specific hashtag.
It's working great and is picking out any postcode provided in the tweet (or returning false if none is available). This is achieved with the line data['location'] = scraperwiki.geo.extract_gb_postcode(result['text']).
But I'm only interested in tweets which include postcode information (this is because they're going to be added to a Google Map at a later stage).
What would be the easiest way to do this? I'm relatively au fait with PHP, but Python's a completely new area for me. Thanks in advance for your help.
Best wishes,
Martin

import scraperwiki
import simplejson
import urllib2

QUERY = 'enter_hashtag_here'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10

for page in range(1, NUM_PAGES+1):
    base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
         % (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
    try:
        results_json = simplejson.loads(scraperwiki.scrape(base_url))
        for result in results_json['results']:
            #print result
            data = {}
            data['id'] = result['id']
            data['text'] = result['text']
            data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
            data['from_user'] = result['from_user']
            data['created_at'] = result['created_at']
            print data['from_user'], data['text']
            scraperwiki.sqlite.save(["id"], data)
    except:
        print 'Oh dear, failed to scrape %s' % base_url
        break

해결책

Do you just want this? I tried on the free ScraperWiki test page and seems to do what you want. If you're looking for something more complicated, let me know.

import scraperwiki
import simplejson
import urllib2

QUERY = 'meetup'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10

for page in range(1, NUM_PAGES+1):
    base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
         % (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
    try:
        results_json = simplejson.loads(scraperwiki.scrape(base_url))
        for result in results_json['results']:
            #print result
            data = {}
            data['id'] = result['id']
            data['text'] = result['text']
            data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
            data['from_user'] = result['from_user']
            data['created_at'] = result['created_at']
            if data['location']:
                print data['location'], data['from_user']
                scraperwiki.sqlite.save(["id"], data)
    except:
        print 'Oh dear, failed to scrape %s' % base_url
        break

Outputs:

P93JX VSDC
FV36RL Bootstrappers
Ci76fP Eli_Regalado
UN56fn JasonPalmer1971
iQ3H6zR GNOTP
Qr04eB fcnewtech
sE79dW melindaveee
ud08GT MariaPanlilio
c9B8EE akibantech
ay26th Thepinkleash

I've refined it a bit so it's a bit picker than the scraperwiki check for extracting gb postcodes, which lets though quite a few false positives. Basically I took the accepted answer from here, and added some negative lookbehind/lookahead to filter out a few more. It looks like the scraper wiki check does the regex without the negative lookbehind/lookahead. Hope that helps a bit.

import scraperwiki
import simplejson
import urllib2
import re

QUERY = 'sw4'
RESULTS_PER_PAGE = '100'
NUM_PAGES = 10

postcode_match = re.compile('(?<![0-9A-Z])([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {0,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)(?![0-9A-Z])', re.I)

for page in range(1, NUM_PAGES+1):
    base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&page=%s' \
         % (urllib2.quote(QUERY), RESULTS_PER_PAGE, page)
    try:
        results_json = simplejson.loads(scraperwiki.scrape(base_url))
        for result in results_json['results']:
            #print result
            data = {}
            data['id'] = result['id']
            data['text'] = result['text']
            data['location'] = scraperwiki.geo.extract_gb_postcode(result['text'])
            data['from_user'] = result['from_user']
            data['created_at'] = result['created_at']
            if data['location'] and postcode_match.search(data['text']):
                print data['location'], data['text']
                scraperwiki.sqlite.save(["id"], data)
    except:
        print 'Oh dear, failed to scrape %s' % base_url
        break

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow