Download pdf file from wikipedia

https://stackoverflow.com/questions/7354156

28-10-2019
|

Question

Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link.

Could someone please tell me how to solve this problem? Formated code on ideone.

import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe

parseHelp :: Tag String -> Maybe String 
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y 
                      then Just $  "http://en.wikipedia.org" ++   snd (   y !!  0 )
                   else Nothing


parse :: [ Tag String ] -> Maybe String
parse [] = Nothing 
parse ( x : xs ) 
   | isTagOpen x = case parseHelp x of 
              Just s -> Just s 
              Nothing -> parse xs
   | otherwise = parse xs


main = do 
    x <- getLine 
    tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
    let lst =  head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
        url =  fromJust . parse $ lst  --rendering url
    putStrLn url
    tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
    print tags_2

Solution

If you try requesting the URL through some external tool like wget, you will see that Wikipedia does not serve up the result page directly. It actually returns a 302 Moved Temporarily redirect.

When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. simpleHTTP, however, will not. simpleHTTP is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects.

You'll want to use the Network.Browser module instead. It offers much more control over how the requests are done. In particular, the setAllowRedirects function will make it automatically follow redirects.

Here's a quick and dirty function for downloading an URL into a String with support for redirects:

import Network.Browser

grabUrl :: String -> IO String
grabUrl url = fmap (rspBody . snd) . browse $ do
    -- Disable logging output
    setErrHandler $ const (return ())
    setOutHandler $ const (return ())

    setAllowRedirects True
    request $ getRequest url

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow