Question

I'm about to take on a new project for a client designing a server-side python program that will poll a number of XML streams at regular intervals and populate a Postgresql database with results. The system will basically act as an XML cacheing daemon and populate a database. That database will then be accessed by a separate program which will serve that data in JSON to a public-facing website.

I'm wondering if anyone has suggestions as to the high-level design of the python caching/polling program. I'm thinking something class-based where a number of classes (one for each XML stream to be polled) descend from a common parent class, check for new content and then grab any if found. I'm thinking to have a table for each of the XML streams on the database, a column for each of the XML fields, and another table to keep track of the "last access" for each stream.

For this I'm planning on using Python 2.7, pyscopg, and ElementTree for XML processing.

Does anyone have any thoughts or criticism on my design or toolchain at this point? Is parsing my XML values to store in a Postgres database more work than is worthwhile? The front-end program that will be serving that data will be turning it into JSON rather than XML by the way.

Was it helpful?

Solution

I'm not sure why you refer to the daemon as "caching"; it is probably better to think of it as a polling-translation or polling-adaptation daemon. I'm often surprised at how subtle naming differences can change how you think about the problem.

Your desire to get the XML data out of XML is a good one. I've yet to see any case where XML is useful for anything other than well-defined data exchange and then only when XML is expected by one side or the other. Put another way, if I want to look for the average weight of an unladen swallow, digging through a pile of XML is about the worst way I can think to get at that data. This points to another place where your writing is inverted compared to my thinking: the data schema is preeminent, XML is just a container. When you say "column for each of the XML fields", I think "column for each data field". Are these different enough to matter? Probably.

You talk of classes and inheritance. People try to use inheritance far more than is probably worthwhile (which is its own diatribe) but I think you might be over-engineering the adaptation part. I can see no reason to bother constructing a Python class representation of input items when the next step is to turn it into a SQL statement and send it away. From what little you've given, instantiating a class for each row could be convenient or it could be overkill; I'd assume overkill unless you can demonstrate otherwise.

Your tool choice is obviously correct with one exception: unless there is reason you've not given for preferring Python 2, use Python 3. Your future self will thank you in five years.

OTHER TIPS

I think using PostgreSQL makes sense, perhaps more than most databases. One of the great luxuries of PostgreSQL is that you have a stored procedure language PL/Python [http://www.postgresql.org/docs/9.3/static/plpython-data.html] and also has good JSON in/out support [http://www.postgresql.org/docs/9.3/static/functions-json.html].

I don't want to comment on your specific implementation, but if you're comfortable with Python, then you can use stored procedures to store your data efficiently, and use the database server to give them back in JSON to any client. Though, as mentioned, once they're in JSON, probably best to keep them that way [as opposed to XML].

Licensed under: CC-BY-SA with attribution
scroll top