Question

I need to programmatically interact with a WebObjects website and extract data from the responses. The particular WebObjects site I am scraping uses component actions and stores sessions in cookies (not urls). This means that all urls look something like this:

http://example.com/WOApp/WebObjects/WOApp.woa/wo/7.0.0.0.29.1.1.1

My first questions are:

  1. Does urls like this not completely destroy local and shared caching opportunities (cachable constraint in REST)? I imaging the only effective caching with such urls is the WebObjects server itself.
  2. Isn't addressability broken as well? Each resource does have a unique endpoint, but it changes constantly. Furthermore (I think) that WebObjects also makes too old URLs invalid since they "time-out" after a period of time. I'm not sure whether this applies only to urls with sessions though.

Regarding the scraping I am not sure whether it's possible to extract any meaningful endpoints from the website. For example, with a normal website I would look through the HTML and extract the POST urls, then use them in my scraper by posting directly to them instead of going through the normal request-response cycle.

In this case I obviously cannot use any URLs extracted from the HTML since they are dynamically generated on each request, but I read something about being able to access WebObjects components directly if the security settings have not been set to disallow this (see https://developer.apple.com/legacy/library/documentation/LegacyTechnologies/WebObjects/WebObjects_3.5/PDF/WebObjectsDevGuide.pdf, p. 53 "Limitations on Direct requests"). I don't understand exactly how to do this though or if it's even possible.

If it's not possible what would be a good approach then? The only options I can think of is:

  • Using a full-blown browser client to interact with the website (e.g. WatiR or Selenium) and extract & process the HTML from their responses
  • Manually extracting the dynamic end-points by first request the page where they are on and then find the place in the HTML where they're located. Then use them afterwards as if they were "static".

I am interested in opinions on how to approach this scenario since I don't believe any of the solutions above are particularly good.

Was it helpful?

Solution

You've asked a number of questions, and I'll see if I can cover each in turn.

Does urls like this not completely destroy local and shared caching opportunities (cachable constraint in REST)? I imaging the only effective caching with such urls is the WebObjects server itself.

There is, indeed, a page cache within the WebObjects application server, and you're right to observe that these component action URLs probably thwart any other kind of caching. Additionally, even though the session ID is not present in the URL, you'd need the session ID in the cookie to re-create the same page, so having just that URL would get you a session restoration error from the application server.

Isn't addressability broken as well? Each resource does have a unique endpoint, but it changes constantly.

Well, yes, on the face of it this is true. You've given a component action URL as an example, and they're tied to the session.

Furthermore (I think) that WebObjects also makes too old URLs invalid since they "time-out" after a period of time. I'm not sure whether this applies only to urls with sessions though.

Again, all true. Component action URLs generate sessions, and sessions time out.

At this point, let me take a quick diversion. I'm assuming you're not the owner of the WebObjects application—you're talking about having to scrape a WebObjects app, and you've identified some ways in which this particular app doesn't conform to REST principles. You're completely right—a fully component-action-based WebObjects application won't be RESTful. WebObjects pre-dates REST by a few years. Having said that, there are ways in which a WebObjects application can be completely RESTful:

  • Using session-less direct actions gives a degree of REST-like behaviour, and would certainly solve the problems you identify with caching, addressability and expiry.
  • Using the ERRest framework to create a 100% RESTful application.

Of course, none of this will help you if you're just trying to scrape a legacy application.

Regarding the scraping I am not sure whether it's possible to extract any meaningful endpoints from the website. For example, with a normal website I would look through the HTML and extract the POST urls, then use them in my scraper by posting directly to them instead of going through the normal request-response cycle.

Again, if it's a fully component action-based application, you're right—all those URLs will be dynamically generated and useless to you.

In this case I obviously cannot use any URLs extracted from the HTML since they are dynamically generated on each request, but I read something about being able to access WebObjects components directly if the security settings have not been set to disallow this…

That's talking about getting a component to render directly from its template with some restrictions:

  • As you note, the application can easily prevent it from happening at all.
  • As mentioned on p.53, the user input and action-invocation phases of rendering the component are skipped, which probably means this approach would be limited to rendering a component that didn't have any dynamic content anyway. This might be of some very limited use to you, though you'd need to know the component names you were interested in, and they wouldn't normally be exposed anywhere.

I'm not sure you're going to find anything better than the types of high-level functional approaches you've already suggested above, such as automating at the browser level with Selenium. If what you need is REST-style direct addressability of resources within the application, you're not going to get that unless you can re-write the application to use direct actions or ERRest where you need them.

OTHER TIPS

A little late, but could help.

I use the Apache's mod_ext_filter (little modified) to pre/post filter the requests/responses from our WebObjects application. The filter calls PHP scripts and can read the dynamical hyperrefs and other things from the HTML pages. The scripts can also modify the HTTP requests, so we can programatically add/remove parameters from the request to implement new workflows in front of the legacy app and cleanup the requests before they will reach WebObjects. It is also possible to handle an additional database within the scripts and store some things over multiple requests.

So you can get the dynamically created links (maybe a button's name or HTML form destination) and can recognize these names within the request.

It is also possible to "remote control" such applications with little scripts like "click on the third button on the page". The only thing you need is a DOM parser to get the structure of the HTML pages and then rebuild the actions which the browser would do (i.e. create the HTTP request manually and send it as POST to the extracted form destination href). The only problem is the Javascript code, which we analyze and reprogram within PHP (i.e. enable/disable input elements, so they will not be transmitted within the requests)

There were some problems within the WebObjects Adapter Module for Apache. It still uses Content-Length within the HTTP header, which you cannot change in mod_ext_filter. If you change the HTML or the parameters within the request, the length of the content will not longer match. But it is possible to change that.

Theoretically it could also be possible to control such an closed-source legacy application from a new UI on a tablet or smartphone, which delegates the user interaction to the backend WebObjects app.

The scripts depends on the page structure, so if your WebObjects app will be changed, you have to correct some things in the scripts (i.e. third button could be now the fourth button).

It should also be possible to add a Restful interface in front of the application and query the data from the legacy app by the filter scripts.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top