Question

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done.

Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset that I can put in a Grid.

Was it helpful?

Solution

If you know the format of the page (for instance, if they're all like that ashnha.com page) then it's fairly easy to write VB.NET code that does this:

  1. Create a System.Net.WebRequest and read the response into a string.
  2. Then create a System.Text.RegularExpressions.Regex and iterate over the collection of Matches between that and the string you just retrieved. For each match, create a new row in a DataTable.

The tough bit is writing the regex, which is a bit of a black art. See regexlib.com for loads of tools, books etc about regexes.

If the HTML format isn't well-defined enough for a regex, then you're probably going to have to rely on some amount of user intervention in order to identify which bits are the addresses...

OTHER TIPS

What type of address information are you referring to?

There are a couple FireFox plugins Operator & Tails that allow you to extract and view microformats from web pages.

Aza Raskin has talked about recognising when selected text is an address in his Firefox Proposal: A Better New Tab Screen. No code yet, but I mention it as there may be code in firefox to do this in the future.

Alternatively, you could look at using the map command in Ubiquity, although you'd have to select the addresses yourself.

For general HTML screen scraping in VB.NET, check out HTML Agility Pack. Much easier than trying to Regex it (unless you happen to be a Regex ninja already!)

The page you mentioned in your answer would be easy to automate, as the addresses are in a consistent format.

But to allow the users to point to any page, that's a much harder job. The data could be in any format at all. You could write something to dump all the text, guess how they are divided, try and recognise bits like country and state names, telephone numbers etc, and get then show your results with an interface that will let the users complete missing sections, move the dividers, and identify the bits you missed or they didn't want.

It's not simple though, and making an interface that provides a big advantage over simply cutting and pasting into validated form fields would be quite an achievement I think - I'd be interested to know how you get on!

EDIT: Just noticed this other question that might cover quite a bit of what you want to do: Parse usable Street Address, City, State, Zip from a string

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top