Extracting the introduction part of a Wikipedia article, by python

https://stackoverflow.com/questions/4295029

28-09-2019
|

문제

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

해결책

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

다른 팁

You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow