Is there another way to do screen scraping apart from regular expressions?

https://stackoverflow.com/questions/80834

screen-scraping

09-06-2019
|

Question

I'm doing a personal, just for fun, project that is using screen scraping to give me a System Tray notification in case another line on an HTML table is added, modified or deleted.

Having done this before I thought: well let's go with the regular expression thing and that's it, but being a curious person, made me think that there could be something else out there that could have another paradigm but be as simple to use.

I know about DOM and X-Path and all the xml'ish approaches. I'm looking for something outside the box, something that can even be defined in a set of rules so you can make a plugin system to aggregate various sites.

Solution

See Options for HTML Scraping

OTHER TIPS

Here's an idea: assuming your main use case is getting a notification whenever an HTML file changes, why not use a standard diff tool and then loop through the changed lines, applying your rules?

Also, if this is a situation where you have access to the server and the files you're watching, you might be able to put everything under source control with CVS (or similar) and just watch for commits. If you want to use this approach for random sites on the web, just write a script that periodically downloads the html for the appropriate URLs and then commits it to source control and watch the diffs.

Not very practical, but outside the box.

If you can convert the source into valid XHTML/XML using something like SgmlReader or HtmlTidy then you could use XSLT. Simply create a XSL template for each site you wish to scrape.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow