Question

We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.

I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.

While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.

However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?

This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?

I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.

Was it helpful?

Solution

There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.

For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).

Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.

If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.

Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top