Question

Whats the most efficient way to query for the existence of an item using SimpleDB? For example, "is there a user with this post code"?

Was it helpful?

Solution

There isn't really all that many alternatives. If the attribute you're looking for isn't the item name your only alternative is using a select. There are two potential approaches. First using count(*)

select count(*) from DomainX where AttributeY='ZValue'

Or, using itemName() which can have further benefits if you'd like to retrieve the item after checking for existence (although you'd probably just select * in that case).

select itemName() from DomainX where AttributeY='ZValue'

Additionally there's the option of using limit

select ..... limit 1

Fortunately Amazon provides a hint as to what's the most expensive through the BoxUsage value returned by each SimpleDB operation. I wrote a small script to run each of the 4 alternatives 25 times (accounting for library warmup) and compare the timing and BoxUsage for each. The domain used contained about 4500 items with 4 attributes each.

My first pass used a single predicate where-clause which matched several items (11 items).

Type of Query                 | Avg time(s) | Avg BoxUsage
------------------------------------------------------------
count(*) without limit        | 0,092       | 0,0000229400
count(*) with limit 1         | 0,092       | 0,0000228616
itemName() without limit      | 0,092       | 0,0000140880
itemName() with limit 1       | 0,090       | 0,0000140080

My second pass used a single predicate where-clause which matched only one item

Type of Query                 | Avg time(s) | Avg BoxUsage
------------------------------------------------------------
count(*) without limit        | 0,090       | 0,0000140080
count(*) with limit 1         | 0,091       | 0,0000140080
itemName() without limit      | 0,090       | 0,0000140080
itemName() with limit 1       | 0,093       | 0,0000140080

The average time isn't statistically significant and probably not all that reliable since I performed the tests from my home DSL. It would have been more suitable to test from an EC2 instance.

The BoxUsage however is interesting. Suggesting that itemName() is a better fit than count(*) although you'd have to take into consideration whether your query would match a lot of items (hundreds) in which case you'd have the overhead of data transfer in the itemName() case although it'd be cheaper BoxUsage wise. That the use of limit doesn't greatly affect the BoxUsage isn't all that surprising since it's meant for paging meaning that you could continue retrieving more results by repeating the query using the NextMarker.

If I where to implement a generic Exists() operation on top of simpledb I'd probably go with

select itemName() from X where Y='Z' limit 1

If performance/cost is sensitive you should perform theese tests yourself in your environment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top