Extracting the introduction part of a Wikipedia article, by python

https://stackoverflow.com/questions/4295029

28-09-2019
|

Pergunta

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

Solução

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

Outras dicas

You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow