lxml不使用django，scraperwiki

https://stackoverflow.com//questions/24005988

20-12-2019
|

题

我正在研究一个django应用程序，通过伊利诺伊州的大会网站刮取一些pdf。在我的桌面上部署时，它可以正常工作，直到urllib2超时。当我尝试在我的Bluehost服务器上部署时，代码的lxml部分会引发错误。任何帮助将不胜感激。

import scraperwiki
from bs4 import BeautifulSoup
import urllib2
import lxml.etree
import re
from django.core.management.base import BaseCommand
from legi.models import Votes

class Command(BaseCommand):
    def handle(self, *args, **options):
        chmbrs =['http://www.ilga.gov/house/', 'http://www.ilga.gov/senate/']
        for chmbr in chmbrs:
            site = chmbr    
            url = urllib2.urlopen(site)
            content = url.read()
            soup = BeautifulSoup(content)
            links = []
            linkStats = []
            x=0
            y=0
            table = soup.find('table', cellpadding=3)
            for a in soup.findAll('a',href=True):
                if re.findall('Bills', a['href']):
                    l = (site + a['href']+'&Primary=True')
                    links.append(str(l))
                    x+=1
                    print x
            for link in links:
                url = urllib2.urlopen(link)
                content = url.read()
                soup = BeautifulSoup(content)
                table = soup.find('table', cellpadding=3)
                for a in table.findAll('a',href=True):
                    if re.findall('BillStatus', a['href']):
                        linkStats.append(str('http://ilga.gov'+a['href']))
            for linkStat in linkStats:
                url = urllib2.urlopen(linkStat)
                content = url.read()
                soup = BeautifulSoup(content)
                for a in soup.findAll('a',href=True):
                    if re.findall('votehistory', a['href']):
                        vl = 'http://ilga.gov/legislation/'+a['href']
                        url = urllib2.urlopen(vl)
                        content = url.read()
                        soup = BeautifulSoup(content)
                        for b in soup.findAll('a',href=True):
                            if re.findall('votehistory', b['href']):
                                llink = 'http://ilga.gov'+b['href']
                                try:
                                    u = urllib2.urlopen(llink)
                                    x = scraperwiki.pdftoxml(u.read())
                                    root = lxml.etree.fromstring(x)
                                    pages = list(root)
                                    chamber = str()
                                    for page in pages:
                                        print "working_1"
                                        for el in page:
                                            print "working_2"
                                            if el.tag == 'text':
                                                if int(el.attrib['top']) == 168:
                                                    chamber = el.text
                                                if re.findall("Senate Vote", chamber):
                                                    if int(el.attrib['top']) >= 203 and int(el.attrib['top']) < 231:
                                                        title = el.text
                                                        if (re.findall('House', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('Senate', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >350 and int(el.attrib['top']) <650:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        vs = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in vs:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()    
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'                                                       
                                                elif int(el.attrib['top']) == 189:
                                                    chamber = el.text
                                                if re.findall("HOUSE ROLL CALL", chamber):
                                                    if int(el.attrib['top']) > 200 and int(el.attrib['top']) <215:
                                                        title = el.text
                                                        if (re.findall('HOUSE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "HB"+title[0]
                                                        elif (re.findall('SENATE', title)):
                                                            title = (re.findall('[0-9]+', title))
                                                            title = "SB"+title[0]
                                                    if int(el.attrib['top']) >385 and int(el.attrib['top']) <1000:
                                                        r = el.text
                                                        names = re.findall(r'[A-z-\u00F1]{3,}',r)
                                                        votes = re.findall(r'[A-Z]{1,2}\s',r)
                                                        for name in names:
                                                            legi = name
                                                            for vote in votes:
                                                                v = vote
                                                            if Votes.objects.filter(legislation=title).exists() == False:
                                                                c = Votes(legislation=title, legislator=legi, vote=v)
                                                                c.save()
                                                                print 'saved'
                                                            else:
                                                                print 'not saved'

                                except:
                                    pass

编辑1 这是错误跟踪

    Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home7/maythirt/GAB/legi/management/commands/vote.py", line 51, in handle
    root = lxml.etree.fromstring(x)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
  File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
  File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91939)
lxml.etree.XMLSyntaxError: None

解决方案

正如乔纳森所提到的，它可能是 scraperwiki.pdftoxml() 这是个问题。您可以显示或记录 x 来确认。

具体来说, pdftoxml() 运行外部程序 pdftohtml 并使用临时文件来存储PDF和XML。

我还要检查的是:

是 pdftohtml 正确设置 在您的服务器上?
如果是这样，如果您直接在服务器上的shell中使用代码失败的PDF运行它，则转换为XML是否有效？它正在执行的命令是 pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "input.pdf" "output.xml"

如果直接运行命令时出现问题，那么问题就在于此。用的方式 pdftohtml 在 scraperwiki 代码，没有简单的方法可以判断命令是否失败。

其他提示

他们会去的方式，添加一个尝试这样您就可以单独检查XML文件。

可能是scraperwiki.pdftoxml出于某种原因的非法XML文件。使用另一个PDFTOXML工具时，我已经发生在我身上。

并请将代码重构为更多功能，它将更容易阅读和维护:)。

另一种方式将首先下载所有PDF，然后解析它们。这样，您可以在某种原因失败时避免使用几次网站。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow