LXML 无法检索网页，并出现错误“无法加载 HTTP 资源”

https://stackoverflow.com//questions/25007501

20-12-2019
|

题

您好，我尝试在浏览器中打开下面的链接，它可以工作，但不能在代码中打开。该链接实际上是新闻站点和从另一个文件 url.txt 调用的文章扩展名的组合。我在普通网站（www.google.com）上尝试了该代码，它运行得很好。

import sys
import MySQLdb
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer
from nltk import word_tokenize
from nltk.tokenize import *
import urllib2
import nltk, re, pprint
import mechanize #html form filling
import lxml.html

with open("url.txt","r") as f:
    first_line = f.readline()
#print first_line
url = "http://channelnewsasia.com/&s" + (first_line)
t = lxml.html.parse(url)
print t.find(".//title").text

这就是我收到的错误。

这是 url.txt 的内容

/news/asiapacific/australia-to-send-armed/1284790.html

解决方案

这是因为 &s url 的一部分 - 绝对不需要：

url = "http://channelnewsasia.com" + first_line

另外，url 部分最好使用以下方式连接 urljoin():

from urlparse import urljoin
import lxml.html

BASE_URL = "http://channelnewsasia.com" 

with open("url.txt") as f:
    first_line = f.readline()

url = urljoin(BASE_URL, first_line)
t = lxml.html.parse(url)
print t.find(".//title").text

印刷：

Australia to send armed personnel to MH17 site - Channel NewsAsia

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow