Question

I have come across the following question: Get the first link in a Wikipedia article not inside parentheses and I am trying to get the same result.

However, the method privileged in the question I cite is to parse the whole wikipedia page in order to get the desired link.

I would prefer to use the wikipedia API but I have come across a major issues: I don't know how (or if it is even possible) to order links by appearance in the page.

The request I have for now is the following:

http://en.wikipedia.org/w/api.php?action=query&titles=United_States&prop=links&pllimit=max

Was it helpful?

Solution

Well, it seems like it is not possible to do this using the API. So I coded a parser in Python and BeautifulSoup. Here is the implementation:

import urllib2
from bs4 import BeautifulSoup

template = "https://wikipedia.org"    

def isValid(ref,paragraph):
   if not ref or "#" in ref or "//" in ref or ":" in ref:
      return False
   if "/wiki/" not in ref:
      return False
   if ref not in paragraph:
      return False
   prefix = paragraph.split(ref,1)[0]
   if prefix.count("(")!=prefix.count(")"):
      return False
   return True

def validateTag(tag):
   name = tag.name
   isParagraph = name == "p"
   isList = name == "ul"
   return isParagraph or isList

def getFirstLink(wikipage):
   req = urllib2.Request(template+wikipage, headers={'User-Agent' : "Magic Browser"})
   page = urllib2.urlopen(req)
   data = page.read()
   soup = BeautifulSoup(data)
   soup = soup.find(id="mw-content-text")
   for paragraph in soup.find_all(validateTag, recursive=False):
      for link in paragraph.find_all("a"):
         ref = link.get("href")
         if isValid(str(ref),str(paragraph)):
            return link
   return False

If you want to see more about this project, here is the github page with the whole source code: https://github.com/ChrisJamesC/wikipediaPhilosophy

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top