如何使用lxml提取一些文本？

https://stackoverflow.com/questions/1621410

06-07-2019
|

题

我想在某些网站上提取一些文字。这里是网址我想要提取一些文字来制作刮刀。

解决方案

通常，要解决此类问题，您必须首先将感兴趣的页面下载为文本（使用 urllib.urlopen 或其他任何内容，甚至是外部实用程序，如curl或wget，但不是浏览器，因为你想看看在任何Javascript有机会运行之前页面如何看起来并研究它以了解它的结构。在这种情况下，经过一些研究，你会发现相关的部分是（在 head 中剪掉一些不相关的部分并为了便于阅读而排列一行）......：

<body onload=nx_init();> <dl> <dt> <a href="http://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=&oid=091&aid=0002497340" [[snipping other attributes of this tag]]> JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a> </dt> <dd class="txt_inline"> EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar"> |</span> 2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd> <dd class="sh_news_passage"> Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b> Times</b> Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>

等等。所以，你想要“主题” ＆lt; dt＆gt; 中的＆lt; a＆gt; 标记的内容，以及“content”的内容。跟随它的＆lt; dd＆gt; 标记的内容（在相同的＆lt; dl＆gt; 中）。

您收到的标题包含：

Content-Type: text/html; charset=ks_c_5601-1987

因此您还必须找到一种方法将该编码解释为Unicode - 我相信编码也称为'euc_kr'，我的Python安装似乎带有编解码器，但是你也应该检查你的。

一旦确定了所有这些方面，就会尝试 lxml.etree.parse 这个网址 - 就像许多其他网页一样，它不会解析 - 它并没有真正呈现格式良好的HTML（尝试使用w3c的验证器来了解它的一些破坏方式）。

由于格式错误的HTML在网络上如此常见，因此存在“容忍解析器”。试图弥补常见错误。在Python中最流行的是BeautifulSoup，事实上lxml带有它 - 使用lxml 2.0.3或更高版本，你可以使用BeautifulSoup作为底层解析器，然后继续“就像”该文档已正确解析 - 但我发现直接使用BeautifulSoup更简单。

例如，这是一个脚本，用于发出该URL的前几个主题/内容对（它们当前已更改，最初它们与您给出的相同;-)。您需要一个支持Unicode输出的终端（例如，我在Mac的Terminal.App设置为utf-8时运行它没有问题） - 当然，您可以以其他方式收集 print Unicode片段（例如，将它们附加到列表中，当你拥有所有必需的片段时，将它们加入''加入），然后根据需要对它们进行编码等等。

from BeautifulSoup import BeautifulSoup import urllib def getit(pagetext, howmany=0): soup = BeautifulSoup(pagetext) results = [] dls = soup.findAll('dl') for adl in dls: thedt = adl.dt while thedt: thea = thedt.a if thea: print 'SUBJECT:', thea.string thedd = thedt.findNextSibling('dd') if thedd: print 'CONTENT:', while thedd: for x in thedd.findAll(text=True): print x, thedd = thedd.findNextSibling('dd') print howmany -= 1 if not howmany: return print thedt = thedt.findNextSibling('dt') theurl = ('http://news.search.naver.com/search.naver?' 'sm=tab%5Fhty&where=news&query=times&x=0&y=0') thepage = urllib.urlopen(theurl).read() getit(thepage, 3)

lxml中的逻辑，或“lxml服装中的BeautifulSoup”，并没有太大的不同，只是各种导航操作的拼写和大小写稍有变化。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow

有用的链接

标签关于我们联系人隐私

Facebook Instagram

内容是在创意共享下获得许可的。

如果您发现侵犯版权，可以通过 info@generacodice.com 要求删除内容。