Difference between local and global indexes in DynamoDB

Question 1

Local Secondary Indexes still rely on the original Hash Key. When you supply a table with hash+range, think about the LSI as hash+range1, hash+range2.. hash+range6. You get 5 more range attributes to query on. Also, there is only one provisioned throughput.

Global Secondary Indexes defines a new paradigm - different hash/range keys per index.
This breaks the original usage of one hash key per table. This is also why when defining GSI you are required to add a provisioned throughput per index and pay for it.

More detailed information about the differences can be found in the GSI announcement

Question 2

Here is the formal definition from the documentation:

Global secondary index — an index with a hash and range key that can be different from those on the table. A global secondary index is considered "global" because queries on the index can span all of the data in a table, across all partitions.

Local secondary index — an index that has the same hash key as the table, but a different range key. A local secondary index is "local" in the sense that every partition of a local secondary index is scoped to a table partition that has the same hash key.

However, the differences go way beyond the possibilities in terms of key definitions. Find below some important factors that will directly impact the cost and effort for maintaining the indexes:

Throughput :

Local Secondary Indexes consume throughput from the table. When you query records via the local index, the operation consumes read capacity units from the table. When you perform a write operation (create, update, delete) in a table that has a local index, there will be two write operations, one for the table another for the index. Both operations will consume write capacity units from the table.

Global Secondary Indexes have their own provisioned throughput, when you query the index the operation will consume read capacity from the index, when you perform a write operation (create, update, delete) in a table that has a global index, there will be two write operations, one for the table another for the index*.

*When defining the provisioned throughput for the Global Secondary Index, make sure you pay special attention to the following requirements:

In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.

Management :

Local Secondary Indexes can only be created when you are creating the table, there is no way to add Local Secondary Index to an existing table, also once you create the index you cannot delete it.

Global Secondary Indexes can be created when you create the table and added to an existing table, deleting an existing Global Secondary Index is also allowed.

Read Consistency:

Local Secondary Indexes support eventual or strong consistency, whereas, Global Secondary Index only supports eventual consistency.

Projection:

Local Secondary Indexes allow retrieving attributes that are not projected to the index (although with additional cost: performance and consumed capacity units). With Global Secondary Index you can only retrieve the attributes projected to the index.

Special Consideration about the Uniqueness of the Keys Defined to Secondary Indexes:

In a Local Secondary Index, the range key value DOES NOT need to be unique for a given hash key value, same thing applies to Global Secondary Indexes, the key values (Hash and Range) DO NOT need to be unique.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html

Question 3

These are the possible searches by index:

By Hash
By Hash + Range
By Hash + Local Index
By Global index
By Global index + Range Index

Hash and Range indexes of a table: These are the usual indexes of previous versions of the Amazon AWS SDK.

Global and Local indexes: These are 'additional' indexes created on a table, in addition to existing hash and range indexes of the table. Global index is similar to a hash. Range index behave similarly to the range index used with the hash of the table. In you entity model in your code, the getter must be annotated in this way:

For global indexes:

@DynamoDBIndexHashKey(globalSecondaryIndexName = INDEX_GLOBAL_RANGE_US_TS)
@DynamoDBAttribute(attributeName = PROPERTY_USER)
public String getUser() {
    return user;
}

For range index associated to the global index:

@DynamoDBIndexRangeKey(globalSecondaryIndexName = INDEX_GLOBAL_RANGE_US_TS)
@DynamoDBAttribute(attributeName = PROPERTY_TIMESTAMP)
public String getTimestamp() {
    return timestamp;
}

Besides, if you read a table by a Global index, it must be an Eventual read (not Consistent read):

queryExpression.setConsistentRead(false);

Question 4

One way to put it is this:

LSI - allows you to perform a query on a single Hash-Key while using multiple different attributes to "filter" or restrict the query.

GSI - allows you to perform queries on multiple Hash-Keys in a table, but costs extra in throughput, as a result.

A more extensive breakdown of the table types and how they work, below:

Hash Only

As you probably already know; a Hash-Key by itself must be unique as writing to a Hash-Key that already exists will overwrite the existing data.

Hash+Range

A Hash-Key + Range-Key allows you to have multiple Hash Keys that are the same, as long as they have a different range key. In this case, if you write to a Hash-Key that already exists, but use a Range-Key that is not already used by that Hash-Key, it makes a new item, whereas if an item with the same Hash+Range combination already exists, it overwrites the matching item.

Another way to think of this is like a file with a format. You can have a file with the same name (hash) as another, in the same folder (table), as long as their format (range) is different. Likewise, you can have multiple files of the same format as long as their name is different.

LSI

An LSI is basically the same as a Hash-Key + Range-Key, and follows the same rules as it, when creating items, except that you must also provide values for the LSIs, as well; they cannot be left empty/null.

To say an LSI is "Range-Key 2" is not entirely correct as you cannot have (using my file and format analogy from earlier) a file named: file.format.lsi and file.format.lsi2. You can, however, have file.format.lsi and file.format2.lsi or file.format.lsi and file2.format.lsi.

Basically, an LSI is just a "Filter-key", not an actual Range-Key; your base Hash and Range value combination must still be unique while the LSI values do not have to be unique, at all. An easier way to look at it may be to think of the LSI as data within the files. You could write code that finds all the files with the name "PROJECT101", regardless of their fileFormat, then reads the data inside to determine what should be included in the query and what is omitted. This is basically how LSI works (just without the extra overhead of opening the file to read its contents).

GSI

For GSI, you're essentially creating another table for each GSI, but without the hassle of maintaining multiple separate tables that mirror data between them; this is why they cost more throughput.

So for a GSI, you could specify fileName as your base Hash-Key, and fileFormat as your base Range-Key. You can then specify a GSI that has a Hash-Key of fileName2 and a Range-Key of fileFormat2. You can then query on either fileName or fileName2 if you like, unlike LSI where you can only query on fileName.

The main advantages are that you only have to maintain one table, instead of 2, and anytime you write to either the primary Hash/Range or the GSI Hash/Range(s), the other(s) will automatically be updated as well, so you can't "forget" to update the other table(s) like you can with a multi-table setup. Also, there's no chance of a lost connection after updating one and before updating the other, like there is with the multi-table setup.

Additionally, a GSI can "overlap" the base Hash/Range combination. So if you wanted to make a table with fileName and fileFormat as your base Hash/Range and filePriority and fileName as your GSI, you can.

Lastly, a GSI Hash+Range combination does not have to be unique, while the base Hash+Range combination does have to be unique. This is something that is not possible with a dual/multi table setup, but is with GSI. As a result, you MUST provide values for both the base AND GSI Hash+Range, when updating; none of these values can be empty/null.

Question 5

Another way to explain: LSI helps you do additional queries on items with same Hash Key. GSI helps you do the similar queries on items "across the table". So very useful.

If you have a user profile table: unique-id, name, email. Here if you need to make the table queryable on name, email - then the only way is to make them GSI (LSI wont help)

Question 6

This documentaion gives pretty good explanation :

https://aws.amazon.com/blogs/aws/now-available-global-secondary-indexes-for-amazon-dynamodb/

I could not comment on this Question ,but which is better in terms of write and read performance :

(Local Index with Table read and write throughput of 100) or (Global index with read /write throughput of 50 along with table's read/write throughput of 50 ? )

I do not need separate partition key for my use case , so local index should be sufficient for the required functionality.

Question 7

GSIs can't be used for consistent reads.

LSIs can be used for consistent reads but they will limit the main partition size to 10GB. Also LSIs can only be created on table creation.