DynamoDB errors: handle with care! (pt 1)

Retries, idempotency, exactly-once processing and your data integrity.

It's a story as old as time: how can a remote database client be sure that a mutation was successfully committed? Okay, the old as time part might be a little grandiose - but it's true that any time you talk to a database over a network, you're introducing some complexity. You're likely communicating over a TCP session, and right there you've mixed in a couple of befuddled Generals. What if you missed a message? Should you try again? In this (three part) blog, I'm going to first introduce the universal problems, explain what may be new to some when learning to work with DynamoDB, and finally I'll share some strategies and tips for handling the unknowns when data integrity is at stake.

"Great database in the sky, PLEASE make this change"

When you make that request to write a change to the database, you are of course hoping for the happy case - but sometimes things go wrong. Packets can be lost, sessions can drop, network links can flap, nodes and network devices can reboot or fail, your application process can crash. So, what happened?! Let's take stock of the possibilities and what they mean.

  1. Database says "YES - YOU ARE LUCKY (THIS TIME) - THE DATA DEITIES HAVE SMILED UPON YOU TODAY". Joy! Move on with your application flow, knowing the database has got your back.

  2. Database says "NOPE - TODAY IS NOT YOUR DAY - YOU LOSE!". Well, that seems pretty clear. Better get up off the ground, dust yourself off, and decide what to do about it. You can retry: "Hey - I said make this change and I meant it!" or perhaps "I beg of you - take mercy on me!". Or, you can just forget it and move on - maybe it's not that important anymore - bail out and apologetically take your application user back a little in their experience.

  3. Database says "[AWKWARD SILENCE]". General, your messenger has not returned. How long should you wait? Eventually, you have to give up. Maybe assume it didn't happen and try again? But what if it did happen and you'd be repeating a transaction that you shouldn't? What if this results in a customer order being processed twice? Yikes.

Photo by Guillermo Velarde on Unsplash

Idempotent operations

This is a subtle topic - causes plenty of confusion. I know because it took me years to get it squared away in my head (thanks for your patience, Somu) - and there might yet be more for me to learn. But I'm going to try to explain what I know as best I can, in a way that I hope will be simple for you to apply in your own thinking.

Idempotence is a property describing an operation that can be applied multiple times and the resulting effect is unchanged. Reading a record is idempotent - you just get the view to the record and no change is made. But what about write operations?

Deleting a record is also idempotent - no matter how many times you do it, the result is the same - the record is gone! How about upserting a uniquely keyed record (writing regardless of whether there's already a record for that key)? Yep, if you were to do it multiple times the effect would be the same each time - idempotent operations. Okay, now let's look a bit closer at updating a record. If my request is to "update the record, adding 5 to the counter value X", and I make that call multiple times - is the effect the same? No, because the result might go from 22, to 27, to 32. Not idempotent. How can I make it idempotent? I could say instead "update the record, adding 5 to the counter value X, but only if the current value of X is 22". Now the operation is idempotent. How about if I ask the database to "insert this item, and assign a new unique key for it". You got it - not idempotent, because the end result is in fact different - with retries, you could wind up with multiple entries of the same data and that might have unintended consequences. A similar non-idempotent effect can be seen if your database allows a function for appending members to a list (where members are not required to be unique) - if you retry this operation, you get more and more repeat entries in the list!

Why does idempotency matter? Using idempotent operations helps to simplify things and give repeatable results. If you use idempotent operations it's easier to build in reliability functions such as retries, replaying of batches, resuming of workflows. Great, now let's look at a commonly conflated requirement that is somewhat linked to this idempotence property.

Workflow ordering and exactly-once processing

Idempotent operations sure sound nice, don't they? But let me present a situation that they don't completely address.

  1. User A is doing some shopping on your retail site. Working with their shopping cart, they add Item X. The database happily inserts the record, with a unique key for Item X in the User A cart. That's idempotent - yay! But the application code does not receive confirmation - there is something flaky going on. It waits a while to see if the database response is just slow...

  2. User A reloads and sees Item X in their cart: the read shows it is present. They change their mind and decide to remove it. So they delete, and again the database happily complies - this time the response is received. Another idempotent operation! The item no longer appears in the cart. User A feels assured that things are working as intended.

  3. Now, the original database request to add Item X to the cart times out, and is retried. Item X is back in the cart and User A is very confused - starts to wonder if your company can be trusted.

This is a pretty simple scenario, but I think we can agree it demonstrates some undesirable behaviors. What can be done to address it? We could add some conditions to the insert in step 1, perhaps. How about if we say "only add Item X if there isn't already an Item X record present? That condition does not provide a fix. What if we change things as follows?

  1. User A wants to put 3 of Item X in their cart. The application first checks how many are there, finds 0, and submits a request to the database to increase the number of Item X in the cart by 3 - but only if the existing number is still 0. No response received from the database and the application continues waiting, but 3 of Item X are in fact added to the cart.

  2. User A reloads their view of the cart and can see there are 3 of Item X in the cart, they change their mind and decide to remove all 3. The application asks the database to remove 3 of Item X, but only if the existing count is still 3. This succeeds and it matches User A's intent. This is an optimistic locking pattern.

  3. Now the original database request to add 3 of Item X (if the current count is 0) is retried, and succeeds. But once again, the result is that the cart contents are not what User A reasonably expects.

Idempotent operations are not everything. Sometimes we need more to keep our data true in representing our process intentions. Will multi-version concurrency control (MVCC) help us in this situation? Let's see.

  1. User A wants to put 3 of Item X in their cart. The application first checks the existing record and finds that there is an item which has version 1, and the count of Item X is presently 0. A request is submitted to the database to increase the count by 3 - but only if the version number is still 1. The change is committed, but the confirming response never makes it to the application and it keeps waiting.

  2. User A gives up and reloads, seeing 3 of Item X in their cart. They change their mind and want to remove that 3 of Item X from their cart. They submit a request to do so, and the application knows the current version of the record is 1. So it asks the database to reduce the count of Item X by 3, but only if the version of the record is still 1 - oh, and bump the version to 2.

  3. Now the original request to add 3 of Item X (if the current version number is 0) is retried. And it fails because the version number has changed. What does this mean? It still cannot tell if the user's intention was applied or not. Should it retry by going to get the latest version and adding 3?

MVCC is also not a complete solution - it leaves some unknowns that might be very important for the user experience, or for crucial data correctness (imagine if these were banking transactions or stock trades).

Photo by Maël BALLAND on Unsplash

To enforce order, there must be a monotonically increasing timestamp or version applied to each change intent - or an ordered queue to work from - one step at a time. And to ensure exactly-once processing semantics, you must request your changes along with the unique identifier for that intent. For example: User A wants to transfer $27 to User B - I'll assign ID UA-UB-f6812f37-4c3d-4f59-95f5-b068e2f73733 to this intent. On successful processing, the unique identifier is stored for future reference. When applying any change, it is made dependent on the unique identifier not already being present in the store. If the identifier is already present, then a prior attempt succeeded - retries can be discontinued knowing that the intent has been satisfied.

Tune in next time...

So, we're beginning to see that getting all of this right is quite complex - and there has been a lot to absorb already. I'll close out this first part of our exploration of the topic for now. Next time, we'll talk about a behavior of DynamoDB that sometimes surprises developers. And I'll share some tips and techniques for adding idempotency, controlling order, and ensuring exactly-once processing with DynamoDB. Follow me to part 2.