Question

I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Thanks

Was it helpful?

Solution

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.

OTHER TIPS

You definitely can do this; but Excel is not the best tool to do the parsing (though I have done it! and people said it wasn't possible -it can be done using asynchronous Windows API calls; good luck with all that...)

The first question you have to ask is whether the site is dynamic or not. Is it generating results on the fly? Another question is: is their a URL convention that is consistent? (So in other words: can you bookmark the results and get back to them in a different session without having to do nothing more than maybe log on to the site...)

If the site is either static or has a consistent URL query mechanism (we say in HTML: the web query is a "GET" as opposed to a Form "POST"...), you can use a nice parser oriented language like Python using a library that accesses web pages; examples should abound in Google. After you get that all debugged and it works reliably (also test that it reports back intelligently the situation when you can't get to the site; temporarily break your network connection for that...), from an Excel macro you can shell to the Python script. The trick is that vanilla shelling in Excel does not block on your shelled command, but runs asynchronously. So using Google again, you can find a Windows API call you can do from Excel to shell to your retrieval task synchronously (if you didn't block on it until it completed, your subsequent macro code expecting to parse results will find none there!) Your Python parsing code can generate a tab delimited text file that your macro can easily load.

See the point of this design? MODULAR. If there is a bug in your parsing, it is much easier to determine by just looking at the CSV. And you are exploiting specialization: you are using a programming language that is designed for parsing (Python, whatever...); VBA is not really a parsing language.

What if it's not static web pages, but dynamic ones that require unique entries to be made? Then, besides doing it using bizarro Windows API calls from an Excel macro, you can make a dynamic parsing script using either Greasemonkey or C#. Greasemonkey is a plugin for Firefox that lets you script website interactions using Javascript. It's fairly intuitive. If you took this approach, you could shell to the Firefox browser for that page that your predefined Greasemonkey script is configured. Again, Greasemonkey could generate a text file of the data, and it's easy to debug it at a later time. Another option I hear is C#; I've never tried that since it's Microsoft specific, but I see many shops do it that way. There is also a Java parsing package called HTMLunit, but I found it broke when trying to emulate Javascript events on a web page. Other HTML parsers you can look at are Jerry and Cobra; and there is this new product called Selenium. I have found Greasemonkey to be the most reliable since it uses an actual browser to operate; whereas, with the exception of Selenium, these other products make virtual replications of browsers, and unfortunately often fail to do so. Some don't even bother to replicate Javascripts that might be on a web page (which can often be the meat and potatoes of how a website's page is rendered!)

Have fun. This is the deep end of the pool, but it's the one that will keep you busy and gainfully employed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top