Pergunta

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

Foi útil?

Solução

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

Outras dicas

  1. You may want to check mwlib to parse the wikipedia source
  2. Alternatively, use the wikidump lib
  3. HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

  1. Parsing a Wikipedia dump
  2. How to parse/extract data from a mediawiki marked-up article via python
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top