'Anatomy of a bad bug fix: Part 1'

I had a horrendous bugfix experience for a web app I am working on these days. I learned so much from this experience that I wanted to share the play-by-play with you in Part 1, and the lessons I learned in Part 2 [coming soon].

Day 1 – Afternoon : The Problem is Discovered

The UI of my web app kicks off a task on the Celery queue on the worker process, and polls for a ‘SUCCESS’ status message before letting the user proceed. The heavy lifting is done in the background, and users see a nice spinner until it finishes.

Some of my alpha users were seeing the UI hang forever. I started thinking of obvious problems:

  • The task is not getting started

  • The task is getting started but it is not completing

  • The task result is not being sent back properly to the UI

I tailed the logs and noticed that the task was starting, and it was completing too. So I decided to repro the issue from home that night.

Day 1 – Night: I try to repro the problem

When I got home, I tried to repro the problem but, lo and behold, I couldn’t. This was frustrating – I had just seen this problem a few hours ago! I thought of a few other problems:

  • Have I set up Celery the ‘right’ way?

  • Am I monitoring Celery correctly?

Just a little bit of background: Celery passes _messages _to worker nodes. These messages can be _stored _(e.g. in a database, or key-value store) or _transient _(passed directly using RabbitMQ or some other AMPQ backend). Using the database as the message store is the simplest (since all you have to do is add a few tables), and you don’t have to add a new component to your stack. I was using django-kombu for just this purpose, so I started researching if this was the right choice.

Lo and behold, I found an opinion that this may be a problem: https://gondor.io/support/celery/

“Using the database is an effective way to get Celery running on a shared instance without having to pay more for a queue. We wouldn't recommend running with the database backend in a production site, but in some cases it may be fine.”

Notice that there is absolutely nothing wrong with this opinion. It is, in fact, solid advice. Even the creator of django-kombu, @asksol, echoes it.

I started thinking that maybe the problem was with django-kombu even though there was no evidence for this (especially since my app is in alpha, and hence has no load).

I went to sleep.

Day 2 – Night: Hey, let’s add a new component to our stack!

By this time, I was home and desperate to fix the bug. It was showing up for a lot of people, and blocking alpha testing. I talked to Olga about it, and she wondered if something in my friend’s environment might have been causing the problem. I brushed this off.

I decided to use Heroku’s Redis addon as the message store. I went with the free 5 MB options, since I was storing tiny task results anyway. This took a couple of hours to get going, but it was mostly painless. I had Celery working with Redis as the message store, and deployed the app happily to Heroku.

Notice that I had done a few bad things:

  • Added a new component (one might argue, a whole layer) to the stack.

  • Introduced a new system which I knew nothing about. Redis has cryptic syntax and a crappy command-line utility. So, in a pinch, I would have to spend a lot of time learning how to debug Redis issues.

  • Not listening to advice that runs contrary to your current thinking is a bad move.

  • Worst, I was still at home and testing from an environment where I had never actually reproduced the bug.

Day 4 – Night: Time to show off my mad dev skills

A couple of days later, we had a friend over for dinner. During the demo, the app crashed with an Internal Server Error 500. Oh no! This was a completely different bug. The stack trace indicated a problem querying Redis. Heroku’s management UI showed that my app had used up all 5 MB of Redis’ allocated memory. What the hell was happening?

At this point, my lack of familiarity with Redis hit me. I spent a good hour figuring out how to connect to the Redis store and dump data. Near midnight, I came across this issue on Github which was exactly what I was running into. Unfortunately, some Celery message queues do not get cleared up on the Redis backend.

Crap.

I decided to just increase the Redis memory to 20 MB. I would have to pay for 20 MB, but at least my app would run for a while. I posted desperately on StackOverflow, too.

Day 5 – Morning: Let’s try to repro that bug again

My friend decided to try the app again. He ran into the same exact issue, with the UI hanging. I verified that Redis was not out of memory. I verified that the task was running and completing.

I was about to pull my hair out. Fortunately, I decided to forget about this problem for a few hours.

Day 5 – Evening: The Dog Walker’s Epiphany

I was walking my dogs when I realized that something wasn’t quite right. Dear reader, you may already have guessed it, but you are probably a better programmer than me.

Maybe, just maybe, something was getting cached somewhere. I quickly prototyped a fix that night, adding a random nonce to the AJAX request.

Voila! The problem is gone. It turns out the request was being cached by a proxy my friend was sitting behind.

Coming soon – lessons learned!