Click here to Skip to main content
15,867,453 members
Articles / Web Development / IIS

The Bug That Got Away

Rate me:
Please Sign up or sign in to vote.
5.00/5 (6 votes)
22 Dec 2019CPOL7 min read 4.8K   2   2
Bug stories: Exposition, action, climax, resolution, or epic failures.

The Bug That Got Away

One thing that I've always loved hearing about from fellow engineers or reading about on technical blogs are bugs. Nasty ones. Ones that keep you up at night and those that will wake you from a dead sleep. These are the ones that great stories are built upon, because like many great stories, they have all of the pieces:

  • Exposition - Ah crap! There's a bug in here somewhere.
  • Rising Action - Let's dig into this and see how widespread it is and how we'll mitigate it.
  • Climax - The "Eureka!" moment when you've narrowed down the exact cause of the bug.
  • Falling Action - Implementing a fix, verifying it fixes the issue.
  • Resolution - Merging the fix into source control, knowing the bug will be gone (forever)!

There's an extreme satisfaction to be found in a good bug. The exploration, the thrill of the chase, and finally catching that bug red-handed and putting an end to it with extreme prejudice.

Unfortunately, not all tales have happy endings; Sometimes the bug gets away.

The Exposition

This particular tale begins as most bug stories do - with a legacy software system. There isn't really anything special here, an older, cobbled together front-end, an enterprise-grade database, etc. If you've seen one, you've seen them all.

At any rate, just prior to an upcoming major release - I get a ping from a colleague to look at something. One of the records in the database is corrupted with some really bizarre encoding patterns. There doesn't appear to be any rhyme or reason behind them, it's just screwy and inconsistent with just about every other area of this application:

Record A: Look everything is nice & shiny!.  
Record B: Look everything is nice & shiny!  
Record C: Look everything is nice & shiny!  

So, upon seeing this - I did said what any good developer would: "Oh, this should be a pretty simple fix.".

The Bug That Got Away

The Rising Action

Software engineering is full of bugs.

There are countless systems, big and small, that are just riddled with the things. As an engineer I know this very well, as I've contributed to my fair share of them. I've been a software engineer over ten years or so and I've always considered myself to be thorough, especially when it comes to tracking down a bug: the research, the deep diving, and finally: the fix.

As with any bug - one of the first steps to fix it, is being able to reproduce it. I spoke with our QA team and they weren't immediately able to reproduce it, but mentioned they would look into it further. Hours pass and I receive another message something to the effect of:

QA Person: Rion, I just spun up a fresh new environment and I can reproduce the issue!

At this point, I'm excited. I had been fighting with this for over a day and I'm about to dive down the bug fixing rabbit hole on the way to take care of this guy. I log into the new environment, and sure enough, QA was right! I can reproduce it! I should have this thing knocked out in a matter of minutes and my day is saved!

Or so I thought. Roughly two hours to the minute of being able to reproduce the issue, it stops occurring. I was literally in the middle of demonstrating the issue to a colleague and minutes later, it's completely vanished. How could this be? Nothing in the environment changed, no machine or web server restarts, no configuration changes, nothing. The bug, just after a matter of hours, seems to have resolved itself.

Skipping to the Last Page

Normally as part of a rising action in a story, things built and build until they reach a point. At this point in my story, I should have figured out the root cause by now. The bug apparently was reproducible for a short while, but not long enough to determine the exact cause (lots of moving parts in this machine). So, I start adventuring to try to find a path to climb up that much higher on debugging mountain. I was pulling everything out of my bag of tricks including:

  • Examining IIS Logs - In multiple environments, I checked through IIS logs in production environments where the issue had occurred, in the short-term reproducible QA environment, my local environment.
  • Examining Event Viewer Logs - Maybe there was some type of exception that was causing the web server to restart and that magically fixed the issue. Surely, there would be something there.
  • Profiling Environments - In times when the issue was reproducible, I took advantage of the SQL Server Profiler and had logs of the exact calls that were being executed against the database.
  • Decompiling Production Code - With a Hail Mary, I attempted to decompile code from the production environment to ensure that no code changes were different and that no calls outside expectation were being made.

Nothing helped. Every single new avenue I'd venture down would only further my confusion and leave me wondering what the heck could be causing the issue. After putting all of the pieces together, you could basically describe the issue as follows:

How could making two sets of calls, all traveling through the same endpoints, passing along the same data, executing the same queries against the same exact stored procedures result in different data (one being corrupted and the other not).

For the first time in years, I felt defeated by a bug. I started grasping at straws, looking for race-conditions, outside forces that might affecting the code, network throttling issues, nothing.

The Bug Won

Many days and nights had passed. This bug was waking me up at night, I was dreaming about potential causes only to run to my computer and try them out and eventually realize they didn't work. Like every good engineer, I had a workaround in mind for this issue just minutes after encountering it, but I was determined to not have to end up there.

I had seen the issue locally, even for a fleeting moment, in several QA environments (again fleeting), and within several production environments. I had tried everything that I could think of, consulting countless peers to brainstorm the cause, but all that resulted in was spreading the bewilderment throughout the team.

This seemingly trivial bug had eluded every form of capture/resolution that I could think of. It left in its wake nothing but bewilderment, not only to myself, but seemingly everyone that I tried demonstrating the issue to. Eventually, much like a doctor, I had to call it.

After over a week of my life, days and nights, being spent pursing this bug: it won. There wouldn't be a climax, there wouldn't be a happy ending, there wouldn't be a nice warm, fuzzy feeling of accomplishment; there'd be a few lines of hacky code to fix it.

The Bug That Got Away

I felt just like our friend Charlie Brown, and this bug had ripped the football away just before I'd ever get a chance to kick it.

It Happens

The reason that I wrote this, or that it's worth writing about really has nothing to do with the bug itself. It has to do with me, and maybe even you. I've always considered myself great at solving problems, and thorough. I'll dig deep, keep digging, exploring, and won't stop until I can crack the problem, until in this case: I couldn't.

Being an engineer is typically about solving problems, but more importantly, it's about being practical. I could have easily spent several more days (and nights) trying to solve this problem and figure out exactly why it was happening, but honestly, the fix for it took no longer than 5 minutes to implement. This was about being able to admit defeat. Much like there's nothing wrong with admitting "I don't know", there's nothing wrong with knowing when to suck up your pride and move on.

If you ask me today, I still don't know what caused this issue. I'll probably never know, and that's alright. I'll let this one get away and tell its friends about me. I know I'll certainly make sure to tell mine about it.

This article was originally posted at http://rion.io/2019/12/21/the-bug-that-got-away

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
An experienced Software Developer and Graphic Designer with an extensive knowledge of object-oriented programming, software architecture, design methodologies and database design principles. Specializing in Microsoft Technologies and focused on leveraging a strong technical background and a creative skill-set to create meaningful and successful applications.

Well versed in all aspects of the software development life-cycle and passionate about embracing emerging development technologies and standards, building intuitive interfaces and providing clean, maintainable solutions for even the most complex of problems.

Comments and Discussions

 
QuestionHere's a bug that blew my mind Pin
rrotstein23-Dec-19 9:07
rrotstein23-Dec-19 9:07 
When I first started out programming, I took a class in which we had to write a simulator for a simple, hypothetical machine, which would run a small program written in a pseudo-assembly language. The program was to calculate and output prime numbers between 1 and 20, then halt.

On my very first test, I was astounded to see that, instead of generating the prime numbers between 1 and 20, it generated prime numbers in descending order from 97 down to 1. Impossible! Crazy! Such a thing can NOT happen!

Then I found an error in the source code: I had used a lot of long variable names, with many underscore characters in them. But I found one place where I had entered a '-' character instead of a '_'.

Nonetheless, the compiler accepted it. I changed that character back to a '_', and then the program worked as expected.

But how could this happen? The compiler probably only looked at the first n characters of the variable names. It probably accepted the remaining trailing characters and generated a corresponding variable. It might have interpreted that '-' character as a minus sign, which then - somehow - caused the generated code to start at 100 and then work its way down. But I have no idea of how this could actually could have happened.
GeneralMy vote of 5 Pin
Arkitec23-Dec-19 8:12
professionalArkitec23-Dec-19 8:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.