Using key-value databases as a set with persistent indices

Question 1

Sounds like you want to do what OpenLDAP does with its Equality index. Perhaps this is the same as the OrientDB example, I didn't read it.

The main table is indexed by a monotonically increasing integer key (called the entryID), and stores the data value. The equality index is indexed by a hash of the value, and stores a list of entryIDs that match the hash. Since the hash might have collisions, just the existence of an entry in the equality index doesn't prove uniqueness or duplication. You still need to check the actual values.

A faster/simpler approach, if you're using MDB, BDB, or some other database that supports duplicate keys, is to just keep one table, using the hash as the key. In both MDB and BDB there is a GET_BOTH request which matches both the key and the data to perform a fetch. If it succeeds then you know for certain that the value already exists. Otherwise, it allows you to save whatever data values and not worry whether or not there are hash collisions.

A caveat here, in MDB using duplicate keys, the size of the values is limited to less than one half of a disk page.

Question 2

Unless I'm missing something here - typically your hash algorithm is consistent and will provide the same key for the same data. Thus you should only need to look up the key to see if it already exists, or handle the (likely duplicate key) error the DB gives back to you.

afaik Key/Value DBs can and will enforce a unique Value constraint for you i.e. you will get an error if you try and save a value that already exists.

Question 3

How big are your value strings?

I would just store them in a key and let the database do all the work.

Typical LevelDB style, which applies to most KV stores, would be to use a pair of keys, prefixed to indicate type

eg:

Key = 'i' + ID 
Value = valueString

Key = 'v' + valueString
Value = ID

In a system that needs to allow for multiple identical valueStrings you would move the ID into the tail of the second key

Key = 'v' + valueString + ID
Value = empty