Question

I'm trying to parse a wiki page here, but I only want the certain parts. Those links in the main article, I'd like to parse them all. Is there an article or tutorial on how to do it? I'm assuming I'd be using BS4. Can anyone help?

Specifically speaking; the links that are under all the main headers in the page.

Was it helpful?

Solution

Well, it really depends on what you mean by "parse" but here is a full working example on how to extract all links from the main section with BeautfulSoup:

from bs4 import BeautifulSoup
import urllib.request

def main():
    url = 'http://yugioh.wikia.com/wiki/Card_Tips%3aBlue-Eyes_White_Dragon'
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page.read())
    content = soup.find('div',id='mw-content-text')
    links = content.findAll('a')
    for link in links:
        print(link.get_text())

if __name__ == "__main__":
    main()

This code should be self explanatory, but just in case:

  • First we open the page with urllib.reauest.urlopen and pass its contents to BS
  • Then we extract the main content div by its id. (The id mw-content-text can be found in the page's source)
  • We proceed with extracting all the links inside the main content
  • In a for loop we print all the links.

Additional methods, you might need for parsing the links:

  • link.get('href') extracts the destination url
  • link.get('title') extracts the alternative title of the link

And since you asked for resources: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ is the first place you should start.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top