How to avoid double submissions to Amazon SES?

Question

According to the SES API Reference for SendRawEmail, the only parameters you provide as part of the request are a list of recipients, the email body, and your source address. Unfortunately, it seems very clear that if you get a timeout rather than a response from SES, there is no way to know if that particular email was sent. I know that's very upsetting. I hate it when I'm in that situation, too.

You do, however, have some options when it comes to figuring out the most practical solution to this dilemma. You could make an blanket decision to never retry and assume that an unsent message is better than a duplicate message. You could also make an blanket decision that duplicate emails are totally acceptable. However, my preferred and recommended approach is the less academically satisfying, yet pragmatic, middle ground. Let me explain.

When integrating with a new service and you find an edge case you don't know how to handle but that you don't expect will happen very often, the best thing to do is collect more info and handle things manually in the mean time. Rome wasn't built in a day, and your cloud service isn't going to work perfectly the first day you turn it on. Instead, when you get a timeout, log it and save away whatever you might need to resend that email later.

Now, imagine you're all done coding and doing integration testing and you have turned on the service in production. The first day, you try to send 100,000 emails. If you get 1000 timeouts, something really weird is going on and you know you need to investigate your network! What if, instead, on the first day, you get 0 timeouts, same on the second day, and then on the seventh day, out of 700,000 attempts for the week, there was only 1 timeout. If appropriate, you can try calling that 1 customer and saying "Hi, sorry to bother you, but we're really committed to reliability and we had a technical problem. I wanted to make sure you got the email receipt for [XYZ]." If they say no, well, then it might make sense to go back and change the code so that when there's a timeout, you just retry after waiting a few seconds since you know it's probably going to work. Same idea for anything in between.

The point here is that you will be applying your human intelligence to the mystery. I've found it's often easier to NOT try to out-smart the unknown. Just set yourself up to be able to handle whatever happens, and figure out what the smart thing to do is when you know what the problem really looks like.

(You may enjoy this blog post -- written by someone else -- about "not handling edge cases".)