문제

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

도움이 되었습니까?

해결책

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

다른 팁

  1. You may want to check mwlib to parse the wikipedia source
  2. Alternatively, use the wikidump lib
  3. HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

  1. Parsing a Wikipedia dump
  2. How to parse/extract data from a mediawiki marked-up article via python
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top