Windows Azure staging <--> production causing conflicts & errors on table storage

https://stackoverflow.com/questions/9415510

12-11-2019
|

Question

We had a terrible problem/experience yesterday when trying to swap our staging <--> production role.

Here is our setup:

We have a workerrole picking up messages from the queue. These messages are processed on the role. (Table Storage inserts, db selects etc ). This can take maybe 1-3 seconds per queue message depending on how many table storage posts he needs to make. He will delete the message when everything is finished.

Problem when swapping:

When our staging project went online our production workerrole started erroring.

When the role wanted to process queue messsage it gave a constant stream of 'EntityAlreadyExists' errors. Because of these errors queue messages weren't getting deleted. This caused the queue messages to be put back in the queue and back to processing and so on....

When looking inside these queue messages and analysing what would happend with them we saw they were actually processed but not deleted.

The problem wasn't over when deleting these faulty messages. Newly queue messages weren't processed as well while these weren't processed yet and no table storage records were added, which sounds very strange.

When deleting both staging and producting and publishing to production again everything started to work just fine.

Possible problem(s)?

We have litle 2 no idea what happened actually.

Maybe both the roles picked up the same messages and one did the post and one errored?
...???

Possible solution(s)?

We have some idea's on how to solve this 'problem'.

Make a poison message fail over system? When the dequeue count gets over X we should just delete that queue message or place it into a separate 'poisonqueue'.
Catch the EntityAlreadyExists error and just delete that queue message or put it in a separate queue.
...????

Multiple roles

I suppose we will have the same problem when putting up multiple roles?

Many thanks.

EDIT 24/02/2012 - Extra information

We actually use the GetMessage()
Every item in the queue is unique and will generate unique messages in table Storage. Little more information about the process: A user posts something and will have to be distributed to certain other users. The message generate from that user will have a unique Id (guid). This message will be posted into the queue and picked up by the worker role. The message is distributed over several other tables (partitionkey -> UserId, rowkey -> Some timestamp in ticks & the unique message id. So there is almost no chance the same messages will be posted in a normal situation.
The invisibility time out COULD be a logical explanation because some messages could be distributed to like 10-20 tables. This means 10-20 insert without the batch option. Can you set or expand this invisibility time out?
Not deleting the queue message because of an exception COULD be a explanation as well because we didn't implement any poison message fail over YET ;).

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow