How do I deal with apostrophes in scrapy and MySQL? Spider completely ignores something that has a " ' " in the data

https://stackoverflow.com/questions/23249507

08-07-2023
|

문제

I have a web crawler using scrapy and python that crawls an university's international entry requirements. Before trying to get the crawler to automatically add results to a MySQL database it worked fine and was able to extract all the information I needed. Now that I've created a pipeline that adds the results automatically to MySQL it misses out the results that contain an apostrophe for some reason. I think it has something to do with the UTF-8 encoding.

Just to clarify this works perfectly apart from when a page contains an apostrophe it refuses to upload that piece of information to MySQL, does anyone know how to deal with it?

I'll provide you with one of my spiders and the item pipeline. Thanks.

Bristol.py

from scrapy.spider import BaseSpider
from project.items import QualificationItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'

class recursiveSpider(BaseSpider):
name = 'bristol'
allowed_domains = ['bristol.ac.uk/']
start_urls = ['http://www.bristol.ac.uk/international/countries/']

def parse(self, response):
    hxs = HtmlXPathSelector(response)

    xpath = '//*[@id="all-countries"]/li/ul/li/a/@href'
    a_of_the_link = '//*[@id="all-countries"]/li/ul/li/a/text()'
    for text, link in zip(hxs.select(a_of_the_link).extract(), hxs.select(xpath).extract()):
        yield Request(urljoin(response.url, link),
        meta={'a_of_the_link': text},
        headers={'User-Agent': USER_AGENT},
        callback=self.parse_linkpage,
        dont_filter=True)

def parse_linkpage(self, response):
    hxs = HtmlXPathSelector(response)
    item = QualificationItem()
    xpath = """
            //h2[normalize-space(.)="Entry requirements for undergraduate courses"]
             /following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])]
            """
    item['BristolQualification'] = hxs.select(xpath).extract()[1:]
    item['BristolCountry'] = response.meta['a_of_the_link']
    return item

pipelines.py

import sys
import MySQLdb
import MySQLdb.cursors
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request

class TestPipeline(object):

def __init__(self):
    self.conn = MySQLdb.connect(
        user='c1024403',
        passwd='Beeph3',
        db='c1024403',
        host='ephesus.cs.cf.ac.uk',
        )
    self.cursor = self.conn.cursor()

def process_item(self, item, spider):
    try:
        if 'BristolQualification' in item:
            self.cursor.execute("""INSERT INTO Bristol(BristolCountry, BristolQualification) VALUES ('{0}', '{1}')""".format(item['BristolCountry'], "".join([s.encode('utf-8') for s in item['BristolQualification']])))
        elif 'BathQualification' in item:
            self.cursor.execute("""INSERT INTO Bath(BathCountry, BathQualification) VALUES ('{0}', '{1}')""".format(item['BathCountry'], "".join([s.encode('utf-8') for s in item['BathQualification']])))
        self.conn.commit()
        return item

    except MySQLdb.Error as e:
        print "Error %d: %s" % (e.args[0], e.args[1])

items.py

from scrapy.item import Item, Field

class QualificationItem(Item):
BristolQualification = Field()
BristolCountry = Field()
BathQualification = Field()
BathCountry = Field()

해결책

Your code suffers from SQL injections.

First, look at your SQL and think what happens when an apostrophe shows up:

self.cursor.execute(
  """INSERT INTO Bristol(BristolCountry, BristolQualification) VALUES ('{0}', '{1})""".format(
      item['BristolCountry'], 
      "".join([s.encode('utf-8') for s in item['BristolQualification']])))

Second, read this.

Final fixed version:

self.cursor.execute(
  "INSERT INTO Bristol(BristolCountry, BristolQualification) VALUES (%s, %s)", (
    item['BristolCountry'], 
    "".join([s.encode('utf-8') for s in item['BristolQualification']])
  )
)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow