Extracting the introduction part of a Wikipedia article, by python
-
28-09-2019 - |
문제
I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.
Can anyone give me a quick solution to this? I'm writing python scripts.
thanks
해결책
I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:
/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/
With the .S option to make . match newlines...
다른 팁
- You may want to check mwlib to parse the wikipedia source
- Alternatively, use the wikidump lib
- HTML screen scraping through BeautifulSoup
Ah, there is a question already on SO on this topic:
제휴하지 않습니다 StackOverflow