Using Yahoo Pipes to extract element from RSS feed item's linked page, turning results into an RSS feed

https://stackoverflow.com/questions/20432873

29-08-2022
|

Question

The German Supreme Court publishes an RSS feed of all its decisions. Unfortunately, the items in this RSS feed, rather than linking to the PDFs of the decision directly, link to a web page in which the PDF is contained in an iFrame.

The web pages are all structured in a parallel manner. For example, in a random RSS feed item's linked web page, the relative link in the source code will look like this:

<iframe border='0' src='document.py?Gericht=bgh&amp;Art=en&amp;Datum=Aktuell&amp;nr=66132&amp;Frame=4&.pdf' width='744px' height='100%'>Leider kann Ihr Browser keine eingebetteten Frames darstellen. Klicken Sie <a href='document.py?Gericht=bgh&amp;Art=en&amp;Datum=Aktuell&amp;nr=66132&amp;Frame=4&.pdf'>hier</a>, um das gewünschte Dokument zu erhalten.</iframe>

The links are all relative to the folder

http://juris.bundesgerichtshof.de/cgi-bin/rechtsprechung/

I want to convert this RSS feed into an RSS feed in which each item's link is a link directly to the PDF, so in my example the RSS feed item's link should become "http://juris.bundesgerichtshof.de/cgi-bin/rechtsprechung/document.py?Gericht=bgh&Art=en&Datum=Aktuell&nr=66132&Frame=4&.pdf".

My idea is to use Yahoo Pipes to loop through all the items of the RSS feed, follow the item's link, look at the source code of the web page and extract the string between <iframe border='0' src=' and the next ', stick the absolute folder path in front of the relative result, and re-assigning this to the item's link. My sad attempt at doing this is found here. Basically, I have no idea what to enter in the XPath module.

Solution

I have bad news for you. I'm afraid this won't be possible.

The solution in this kind of situation is to create two pipes:

A low-level pipe:
- Receive a URL Input with values like this: http://juris.bundesgerichtshof.de/cgi-bin/rechtsprechung/document.py?Gericht=bgh&Art=en&az=IX%20ZR%2044/12&nr=66132
- Use the XPath Fetch Page module to fetch the URL
- Extract the iframe attribute, hopefully, and return as result
A higher level pipe:
- Fetch your original URL with Fetch Feed
- Loop over the feed items, in each iteration calling the low-level pipe using the URL field of the feed item and assign the result to an attribute
- Construct the URL from the newly assigned attribute

And this would probably work. Except that, unfortunately, this website rejects Yahoo Pipes: it receives a "Forbidden 403" error when trying to fetch that page.

So this is cannot work directly with Yahoo Pipes. An alternative is if you can setup a proxy server, which could relay the requests so that the German website cannot know they are coming from Yahoo Pipes.

Btw, this is the same reason I cannot create custom feeds based on IMDB (the internet movie database). They refuse all requests coming from Yahoo Pipes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow