I am not much of a story teller. But a very simple task on my current project has made me one. This is a story about how a simple task turned notorious, about how a bad decision was made, and the struggle around the bad decision to make it work.
Let me tell you a story about a STORY that
- …took me 4 days to finish.
- …6 PRs were rejected before it goes to master.
- …damaged my I AM A GENIUS attitude.
- …made me to miss a Holi celebration, a friends meet and also a DATE (though there was no time limit attached to it).
Before I tell you what it is, let me give you a context of what I was working on.
The task was to Implement a “Account Reset” feature(from multi-user account to single user account), which involved deleting a bunch of resources.
Account(Company) - has many Users in pro account type and has one User in default type. User - has many multimedia resources(Images, Audio files, Video files, PDFs) whose files are stored on Amazon S3 using Carrierwave.
On reset of account we need to delete all the users and their associated data.
Providing reset UI, designing api, controller action, deciding on callbacks took me just few hours. Deleting resources is just 1/10 of the story. If files where not stored on S3 it would have been as simple as
user.resources.destroy_all in an
But it didn’t happen. Here is why.
1, TIME OUT: An account can have min 6 users to any number of users. at an average 10 users. each user can have upto 50 resources . which means 500 resources to be deleted. Deleting a resource should also delete the associated file on S3. The request used to take insanely long. But we would generally terminate request that takes more than 15 secs.
Solution: Simple! Move the deletion of resources out of transaction with the help of delayed jobs..
OK fine! Did it solve my problem? NO.
2, ROLL BACK: Using a delayed job for deletion definitely moves the S3 image deletion out of transaction But what happens if the transaction rolls back? Lets say deletion of a resource failed with valid reason and it calls a rollback. Since you have already initialized a delayed job, it is not easy to revert it right?
Solution: Test whether the user is deleted or not before deleting the resources of user. If user exists in DB then don’t delete the resources.
OK fine! Did it solve my problem? NOO.
3, Job was picked even before the transaction is committed : Yes! this can happen right?
We create a delayed job in
after_destroy call back. The delayed job could be picked up for execution even before the destroy transaction is committed. So when the delayed job runs a check to make sure the resource has removed before running the S3 file delete, the resource may still exist..
Retries may help, however this may not be a reliable workaround and it is a bad idea to raise an error without reason.
Solution: Send a delay param to sidekiq job. so that it is picked up for execution after some time.
OK fine! Did it solve my problem? NOOO.
4, We don’t know exactly how much time it takes to complete the transaction. if its less than the delay passed to sidekiq job, then the UI shows the resources under account which wrong(There should not be any resource under account once account reset is done). If time taken to complete the transaction is more than the time I passed to sidekiq job then delayed job fails to delete the resources.
Solution: Init the sidekiq job on
OK fine! Did it solve my problem? NOOOO.
5, When we initiate the sidekiq job in the
after_commit callback, it is not easy to rollback the transaction if delete fails. And its also not good idea to use
after_commit in this situation because if the resource deletion fails, the transaction should be rolled back(note: We are resetting the account - not deleting the account. So there should be no resource under the resetted account).
Then I heard a voice from invisible source saying “Oy, Cool down Manohar what are you trying to do? you are wasting time in fixing the issue caused by a bad solution? Sit down and relax, may be have a beer and then think”
Well I didn’t have a beer, But I just took a break.
I trusted the voice from invisible source and ditched the async worker idea and its problems. I concentrated on the problem statement and started from scratch.
So what should I do??
After sitting idle for few hours, again a sound from invisible source - “Soft Deletable”
Yes. Soft delete the resources (make it look like they are deleted while they still exist in DB) , and add a scheduled task to collect all the soft deleted resources and destroy the entries and S3 references.
Did Soft Deletable solve all my problems? Yessss!! :)
TIme Out - Resource are not deleted inside transaction Rollback - Transaction can be easily rolled back as soft deletable just updates db record.
I had couple of options.
Awesome!!!!!. That does the Job.
Oh wait wait. Soft deletable did solve the problem but I could not push it the production. By the time I finish the task and verified on Staging it was friday 6:30 PM(The time at which code editor goes to background). And here comes a senior colleague warning about No release on Friday.
Finally, Lessons learnt:
Technical: Business may require the functionality to appear immediate. For example as soon an account reset, multi user features on that account must be disabled. However it does not mean implementation of the same has to be immediate as well. Trying to make it work, especially when external systems are involved, may lead to unnecessary work arounds and poor user experience. Simple patterns such as “Soft Delete” give the perception of immediacy to the user while keeping the system resilient from any unavoidable failures or delays in reaching external systems by abstracting away the details of deletion.
Philosophical: When something is broken don’t waste your time in fixing it, it will never be the same, try to walk away, but make sure you don’t trip on it
Personal(too many): Don’t miss the date. Friday is not a doomsday, there is always a monday next to it. You dont have to behave like you are super dedicated to finishing things right away, sometimes you just need to “give it some time”.
Image Source: http://giphy.com/, http://londons365.tumblr.com/