Grabbing text from a webpage

https://stackoverflow.com/questions/419260

03-07-2019
|

Question

I would like to write a program that will find bus stop times and update my personal webpage accordingly.

If I were to do this manually I would

Visit www.calgarytransit.com
Enter a stop number. ie) 9510
Click the button "next bus"

The results may look like the following:

10:16p Route 154
10:46p Route 154
11:32p Route 154

Once I've grabbed the time and routes then I will update my webpage accordingly.

I have no idea where to start. I know diddly squat about web programming but can write some C and Python. What are some topics/libraries I could look into?

Solution

Beautiful Soup is a Python library designed for parsing web pages. Between it and urllib2 (urllib.request in Python 3) you should be able to figure out what you need.

OTHER TIPS

What you're asking about is called "web scraping." I'm sure if you google around you'll find some stuff, but the core notion is that you want to open a connection to the website, slurp in the HTML, parse it and identify the chunks you want.

The Python Wiki has a good lot of stuff on this.

Since you write in C, you may want to check out cURL; in particular, take a look at libcurl. It's great.

You can use the mechanize library that is available for Python http://wwwsearch.sourceforge.net/mechanize/

You can use Perl to help you complete your task.

use strict;
use LWP;

my $browser = LWP::UserAgent->new;

my $responce = $browser->get("http://google.com");
print $responce->content;

Your responce object can tell you if it suceeded as well as returning the content of the page.You can also use this same library to post to a page.

Here is some documentation. http://metacpan.org/pod/LWP::UserAgent

That site doesnt offer an API for you to be able to get the appropriate data that you need. In that case you'll need to parse the actual HTML page returned by, for example, a CURL request .

This is called Web scraping, and it even has its own Wikipedia article where you can find more information.

Also, you might find more details in this SO discussion.

As long as the layout of the web page your trying to 'scrape' doesnt regularly change, you should be able to parse the html with any modern day programming language.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow