Question

I'm looking for an aggregator for the editoral and op-ed pages of a bunch of English language newspapers I want to follow. The objective is to generate an HTML that is just a collection of editorial pieces from the dozen newspapers I want to follow internationally, so that I can print them off in the morning. Since this is a very narrow requirement, I couldn't find anything already available so I'm thinking of writing one on my own.

Now, I used to be a programmer for ~8 years in my previous life (and now have been swayed to the "Dark Side" that is Wall Street after my MBA). I'm not knowledgeable enough today about programming to make a good choice on a scripting language so am unsure which the best language for this would be (performance is not a key issue, libraries for parsing HTML, text handling as well as getting data off live web pages are more important).

PS: I don't mind learning a new language (previously I worked extensively with x86 ASM, C and Visual C++/MFC) almost exclusively in Win32 environments.

Was it helpful?

Solution

Use Python and the excellent lxml library for scraping HTML. It supports CSS selectors, which is a huge convenience, and it's rather fast. It handles broken HTML well too.

OTHER TIPS

interpreted languages do well with code generation, you should think about Perl or Ruby

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top