What's the hardest bug you've debugged?

Sort

Iterator · Upvoted by

, phd Software Engineering & Graph, Massachusetts Institute of Technology (2012) and

, M.Sc Software Engineering, PSG College of Technology, Coimbatore (2016) · Author has 6.2K answers and 32.3M answer views

· Updated 8y

In year 2000, the group that is responsible for the JPEG file format, called the JPEG group, decided to come up with a new version of the JPEG format. It was called JPEG2000. It had some really cool ideas. One of them was that it supported streaming of images. One JP2 image contained multiple resolutions of the same image, and the lower resolutions were kept upfront. So, when you are downloading the image, you will get a lower resolution immediately. The advantage was that on low speed connections, the browser could show a low-res image pretty quickly. Also, devices that were low-res could sim

At the time, JPEG2000 was hoping that the JP2 standard would make its way into browsers (spoiler: it still hasn’t). We wanted to use it to build mapping applications. We had encoded aerial photos into the JP2 format, and we had a server that returned SVG maps. Since there was no browser support, I built an ActiveX control that would stream the JP2 image and overlay it with the SVG maps. It was very cool. Our resolution was like 10 times better than Google Maps (at the time).

I was using a library called Kakadu library. It was an open-source library that could stream and parse JP2 images. Kakadu performed really really well. It was fast! Except that once in a while it would get stuck. Randomly. No pattern. So, they pulled me in and I suspected a problem that arises because of thread contention. So, I got to debugging it. At the time, I was young, and I understood multi-threading pretty well, but hadn’t really fixed a thread-contention problem. I was excited.

So, I dug into their code and started debugging it. The first thing I figured is when I debug, the problem goes away. Crap! Essentially, the debugger itself acts as a synchronization mechanism, and it changes the timing of how the instructions in the thread execute.

So, I started adding logs. The next thing I figured out is that when I add logs, the problem goes away! Crap again! Again, since the logging goes to the file system, the file system acts as a synchronization mechanism and throws the timing off.

So, I can’t debug, and I can’t log. It’s a fairly complicated piece of engineering, that I haven’t written. So, before I solve the real problem, I need to figure out how to troubleshoot in a multi-threaded environment. Crap!

So, I started thinking that it’s the synchronization by things outside the code that throws the timing off, right? So, as long as I keep the code inside of the external synchronization very tight, I might prevent the timing from going off. So, I started minimizing my log statements. Eventually, I figured out that if I put one character logs, I am fine. And I can’t put too many 1-character logs.

So, the first thing I had to figure out was if the code takes a path that is different when the problem occurs versus when the problem doesn’t occur. Remember, I could put only single character logs, and not too many of them. So, I started reading through the code without trying to understand it. Whenever I reached a decision point, I would put 2 logs in the 2 branches. One branch logged “\”, the other branch logged “/”. When I saw a loop, I logged “|” inside the loops. When I would see the logs, I would see the log as

\|||//|||||||||\/\/\\\

I would run the app and note down these strings of characters when the code ran fine, then note down the characters when it didn’t. Next I compared the strings of characters to find the deviation. Finally, I would trace back through the code to find the spot at which the log message deviated.

Since I couldn’t put too many logs, I had to be judicious. Luckily, the Kakadu code was structured very very nicely. I want to kiss the Kakadu developers (even though they caused the bug). All their code was layered very nicely. They had high-level functions that called lower-level functions, that called even-lower-level functions. So, I picked the topmost layer and put my magic single-character logs there. When I found the deviation, I would understand the code to figure out why the deviation happened. Usually, it was because a lower-level routine behaved differently. So, I had to remove all my logs, and then add similar logs in a lower-level routine. I did this layer by layer till I found the bug.

This entire process took about three weeks. It was a one character fix. There was a busy-wait loop in the rendering thread that waited for data to be loaded by a producer thread. It checked a counter using < instead of <=. Usually, the counter would go from counter < expected to counter > expected, and it would work fine. In the rare condition that the = condition was satisfied, the rendering thread would prematurely parse the data, get an exception and cleanly exit. This would stop all rendering.

Three weeks, one character. I should get a T-shirt that says that.

Fixing this bug really showed me the value of building your code in layers.

Promoted by Savings Pro

Mark Bradley

Economist

· Updated Dec 1

What are the stupidest money mistakes most people make?

Where do I start?

I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.

Here are the biggest mistakes people are making and how to fix them:

Not having a separate high interest savings account

Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.

Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.

Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of th

Where do I start?

I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.

Here are the biggest mistakes people are making and how to fix them:

Not having a separate high interest savings account

Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.

Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.

Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of the biggest mistakes and easiest ones to fix.

Overpaying on car insurance

You’ve heard it a million times before, but the average American family still overspends by $417/year on car insurance.

If you’ve been with the same insurer for years, chances are you are one of them.

Pull up Coverage.com, a free site that will compare prices for you, answer the questions on the page, and it will show you how much you could be saving.

That’s it. You’ll likely be saving a bunch of money. Here’s a link to give it a try.

Consistently being in debt

If you’ve got $10K+ in debt (credit cards…medical bills…anything really) you could use a debt relief program and potentially reduce by over 20%.

Here’s how to see if you qualify:

Head over to this Debt Relief comparison website here, then simply answer the questions to see if you qualify.

It’s as simple as that. You’ll likely end up paying less than you owed before and you could be debt free in as little as 2 years.

Missing out on free money to invest

It’s no secret that millionaires love investing, but for the rest of us, it can seem out of reach.

Times have changed. There are a number of investing platforms that will give you a bonus to open an account and get started. All you have to do is open the account and invest at least $25, and you could get up to $1000 in bonus.

Pretty sweet deal right? Here is a link to some of the best options.

Having bad credit

A low credit score can come back to bite you in so many ways in the future.

From that next rental application to getting approved for any type of loan or credit card, if you have a bad history with credit, the good news is you can fix it.

Head over to BankRate.com and answer a few questions to see if you qualify. It only takes a few minutes and could save you from a major upset down the line.

How to get started

Hope this helps! Here are the links to get started:

Have a separate savings account
Stop overpaying for car insurance
Finally get out of debt
Start investing with a free bonus
Fix your credit

Related questions

What is the hardest thing when coding? For me it's debugging, but for all of you, what is it?

What practices help to minimize software bugs?

Did you ever intentionally leave a bug while writing a software code?

What are the worst kind of bugs in programming? How would you normally go about debugging them?

What is the strangest fix you've ever had for a computer bug? How long did it take you to figure it out?

Lakis Karmirantzos

Worked on web search. Now building maps.

· 7mo

I was the supporting our multi-platform program (LabVIEW) for Windows 3.1

Second most requested feature (hundreds of reports from the users) was to fix the print functions to work on all Windows printers. The problem was that the string font would change size arbitrarily and it would become either too small or too large.
Our code was working fine on MacOS, SunOS, HPUX, Windows printers that supported postscript and high end windows printers .
But the average user did not used these fancy,expensive printers. No, they were using the cheap, 50$ printers that you bought during sales at Best Buy or Circuit City.
It looks like 50% of these cheap printers were working fine but the other 50% was failing. Of course, Microsoft Word and Excel were working fine on all these printers.
But it was impossible to figure out the problem. The same code that was drawing the data on screen, was drawing the data on the printer. Same code, same picture, some printer worked, some don’t.
I was baffled. I ask around. I went to Microsoft (we had some special business support ). Nothing helped.

Five years (yes not a typo, 5 years) later, I was still trying to debug the issue when in a act of desperation, I’ve decided to calculate my own bounding box of a string one character at a time and add them together instead on relying on the build-in function that you pass the whole string and it returns you the bounding box (based of course on font size).
And that is when I discovered that the build-in function in windows is actually implemented by the device driver. And the device driver for cheap printers were buggy as hell. They could calculate the size of one character at a time, but give the driver a string and the answer is a random bounding box. Basically cheap printers has cheap printer drivers that had multiple bugs.
Switching to my calculations instead of the printer driver calculation fixed the problem. My code was right from the beginning.

Just because an API exist, it does not mean that it does the right thing under all circumstances ;-)

Your response is private

Was this worth your time?

This helps us sort answers on the page.

Absolutely not

Definitely yes

Rafael Sarres de Almeida

Former Network Administrator at Superior Tribunal de Justiça (2000–2008) · Author has 78 answers and 1.1M answer views

· Updated 5y

What's the hardest bug you've debugged?

It was not a hard bug for a developer, but it was for me. Let me lay out the background.

I was a network administrator at the Superior Court of Justice in Brazil (STJ) from 2000 to 2008 responsible for supporting and performance monitoring of Tomcat application servers, in addition to managing the entire local and remote data network. Our developers had full access to the production servers, they could deploy a new application or update an existing one anytime. Yes, I know, big mistake, but it was long ago, before the IT governance era on STJ. IT was reall

What's the hardest bug you've debugged?

It was not a hard bug for a developer, but it was for me. Let me lay out the background.

I was no java developer. I have developed in C++ and Pascal a lot during my university years and kept creating shell and perl scripts to support my role at STJ. One of my greatest skills was application troubleshooting at network level using a packet sniffer, Sniffer Pro at first and, later, the almighty Wireshark. When I was idle at work, I liked to mirror some busy server port and spend hours understanding the cascade of Ethernet frames.

It was like being Cypher staring at the Nebuchadnezzar´s Matrix console:

“You get used to it, though. Your brain does the translating. I don't even see the code. All I see is blonde, brunette, redhead.”

Unfortunately I didn´t see the beautiful women that Cypher saw, but I knew the traffic baseline as no one. It was before this cryptography-everywhere era. HTTPS was an unnecessary burden to our deskt… err, servers. Bottom line, I could see everything: requests, responses, database queries, measure response times and correlate them.

Modesty aside, I was good. I was so good that other IT departments lined at my door for a troubleshooting analysis. DB, authentication, file servers, I debugged them all. They came with the problem: My application is slow, it fails at random times, and the most common:

“- I tested it thoroughly at my local development environment (application, files and database at a local desktop) and it worked fine, now that I moved to production, my performance is crap. Your network sucks!”

9 out of 10 times I could pinpoint the bottleneck: Chattiness, huge file accesses or even some left out code that queried non-existent servers until a timeout. After some months they learned to stop blaming my top-notch data network and just came asking for help, which I gladly provided every time.

STJ had even given me a bonus for teaching a network troubleshooting course for other network administrators, where I taught every trick I had on my sleeve: packet filtering, reordering, time anchoring, tcp stream following, windowing, data reconstruction, HTTP request-database query correlation… Everything. Unfortunately no other employee had reached my troubleshooting skill level, maybe because I am a lousy teacher, or maybe there was not enough engagement from their part, or both.

Network was my kingdom and I was the king.

Enough blowing my own trumpet. Now with the bug: it was just another day at work when suddenly my Tomcat server crashes. First response: restart the service. The server runs fine again.

A couple of days later, another inexplicable crash in the middle of the afternoon. My boss complains. I ask the java developer team if they had made any change on the applications recently. They say the code is months old and no change had been made.

Some days later, another crash. I had enough. Our users are complaining, my boss is blaming my servers and the java developers swear that they did not change anything.

I enable every java management extension I could see and start logging. Now I am just looking forward for the next crash, which, of course, happens some time after.

I go through the logs, and what I found was a quick surge on the memory usage of the java process. It suddenly spiked to the maximum heap size and crashed the server, jumping from a healthy 50% to 100% in a matter of minutes. No clue of what application or class misbehaved.

I ask the developers again: negative responses and some frowns upon me. Ok, maybe it is my server, let´s see…

Out of management extensions left, I decided to call in the big guns on this bug:

RELEASE THE KRAKEN!

Wireshark laptop deploy: Check!
Server port mirror: Check!
Promiscuous mode: Check!
Spurious traffic filters: Check!
Circular buffers: Check!

Now that wireshark is online, I just wait for the incident to happen again, and it surely happens. I think: “Let´s get down to business, shall we…”

I look at the java management extension logs and I pinpoint the approximate time the server starts misbehaving based on the memory usage logs. I deep dive on the packets of that approximate time, and to my dismay, there are zillions of packets. I saw blondes, brunettes, redheads, dwarfs, zebras, giants, E.T.s…. But not a single defective request. The server was on a very busy time, the capture was simply too crowded. It was a needle in a haystack.

Think, McFly, think! How a misbehaving request is different from a well behaved one?

OF COURSE!!! There is no spoon!

There is no RESPONSE from a defective application!!!

I crack my knuckles and start coding some filters. First, only HTTP requests. Then, I add the responses. I separate the individual streams, fiddle with the time counting and ordering and VOILA! I locate the only HTTP request with absolutely no response. Just a TCP ack and radio silence !! And it occurs just minutes before the crash, right on the start of the memory usage climbing.

Now I remove some filters and analyse what packets appear around this problem. Surely enough, there is a database query immediately after the HTTP request being received by the server. And another. And another. The request caused an infinite loop of repeated database queries until the server crashes. I extract the offending SQL request text.

Time to locate the faulty code. I do a grep on my server files searching for the SQL query I have just discovered. I found the java file containing the query, and of course, it is inside a loop.

I march to the java developers office, kick the door in and, without warning, yell only the name of the defective java file. A guy on the corner of the office stares at me with big startled eyes and, in an almost inaudible squeak, moans:

“- This change shouldn´t have caused a problem…”

Long story short: Developers were kicked out of production servers. They now had to open a formal request to update applications outside of business hours. We have started a long due IT governance project.

Epilogue

Two months after I left STJ in 2008, I receive a call from the new network administrator asking for help debugging an application misbehavior. I instruct him to mirror the correct server port, capture some data during the issue and send me the file.

Surely enough, I found the problem.

I recently heard that there are still tales around STJ about a legendary guy that used to solve all IT software problems armed only with Wireshark.

These were really fun times!

Edit: As this answer had quite a view count and was even read by one of the STJ developers that witnessed my adventure(thanks for dropping me a line in the comment section, Flavio Borges Botelho), I dived into my digital archives for the actual bug report I wrote after kicking the door in. And I found not only the almost 12 year old report, but even the actual capture file with the offending HTTP request packet number (63352) conveniently written in its file name.

Nicely done, young Rafael McFly. :)

So I think it would be entertaining to my nerd-readers see the actual NMIS graphs, Wireshark capture and the demoniac java code snippet. Enjoy.

Clue 1: Abnormal data traffic increase minutes before each crash (I had two on this day) :

Clue 2: JVMStat Old Generation memory pool usage during the same time frame:

Clue 3: A HTTP request without an answer buried deep in a 74 seconds capture file with 100.000 packets, aka, a needle in a haystack:

Clue 4: A barrage of identical SQL queries just after the never answered HTTP request:

A grep later: VOILA

And remember, kids…

Never trust a database query to stop a loop!

Dave Baggett

Naughty Dog (employee #1), ITA Software (co-founder), inky.com (founder) · Author has 54 answers and 2.8M answer views

· Updated 10y

It's kind of painful to re-live this one. As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.

This is my hardware bug story.

Among other things, I wrote the memory card (load/save) code for Crash Bandicoot. For a swaggering game coder, this is like a walk in the park; I expected it would take a few days. I ended up debugging that code for six weeks. I did other stuff during that time, but I kept coming back to this bug -- a few hours every few days. It was agonizing

This is my hardware bug story.

The symptom was that you'd go to save your progress and it would access the memory card, and almost all the time, it worked normally... But every once in a while the write or read would time out... for no obvious reason. A short write would often corrupt the memory card. The player would go to save, and not only would we not save, we'd wipe their memory card. D'Oh.

After a while, our producer at Sony, Connie Booth, began to panic. We obviously couldn't ship the game with that bug, and after six weeks I still had no clue what the problem was. Via Connie we put the word out to other PlayStation 1 developers -- had anybody seen anything like this? Nope. Absolutely nobody had any problems with the memory card system.

About the only thing you can do when you run out of ideas debugging is divide and conquer: keep removing more and more of the errant program's code until you're left with something relatively small that still exhibits the problem. You keep carving parts away until the only stuff left is where the bug is.

The challenge with this in the context of, say, a video game is that it's very hard to remove pieces. How do you still run the game if you remove the code that simulates gravity in the game? Or renders the characters?

What you have to do is replace entire modules with stubs that pretend to do the real thing, but actually do something completely trivial that can't be buggy. You have to write new scaffolding code just to keep things working at all. It is a slow, painful process.

Long story short: I did this. I kept removing more and more hunks of code until I ended up, pretty much, with nothing but the startup code -- just the code that set up the system to run the game, initialized the rendering hardware, etc. Of course, I couldn't put up the load/save menu at that point because I'd stubbed out all the graphics code. But I could pretend the user used the (invisible) load/save screen and asked to save, then write to the card.

I ultimately ended up with a pretty small amount of code that exhibited the problem -- but still randomly! Most of the time, it would work, but every once in a while, it would fail. Almost all of the actual Crash Bandicoot code had been removed, but it still happened. This was really baffling: the code that remained wasn't really doing anything.

At some moment -- it was probably 3 am -- a thought entered my mind. Reading and writing (I/O) involves precise timing. Whether you're dealing with a hard drive, a compact flash card, a Bluetooth transmitter -- whatever -- the low-level code that reads and writes has to do so according to a clock.

The clock lets the hardware device -- which isn't directly connected to the CPU -- stay in sync with the code the CPU is running. The clock determines the baud rate -- the rate at which data is sent from one side to the other. If the timing gets messed up, the hardware or the software -- or both -- get confused. This is really, really bad, and usually results in data corruption.

What if something in our setup code was messing up the timing somehow? I looked again at the code in the test program for timing-related stuff, and noticed that we set the programmable timer on the PlayStation 1 to 1 kHz (1000 ticks/second). This is relatively fast; it was running at something like 100 Hz in its default state when the PlayStation 1 started up. Most games, therefore, would have this timer running at 100 Hz.

Andy, the lead (and only other) developer on the game, set the timer to 1 kHz so that the motion calculations in Crash Bandicoot would be more accurate. Andy likes overkill, and if we were going to simulate gravity, we ought to do it as high-precision as possible!

But what if increasing this timer somehow interfered with the overall timing of the program, and therefore with the clock used to set the baud rate for the memory card?

I commented the timer code out. I couldn't make the error happen again. But this didn't mean it was fixed; the problem only happened randomly. What if I was just getting lucky?

As more days went on, I kept playing with my test program. The bug never happened again. I went back to the full Crash Bandicoot code base, and modified the load/save code to reset the programmable timer to its default setting (100 Hz) before accessing the memory card, then put it back to 1 kHz afterwards. We never saw the read/write problems again.

But why?

I returned repeatedly to the test program, trying to detect some pattern to the errors that occurred when the timer was set to 1 kHz. Eventually, I noticed that the errors happened when someone was playing with the PlayStation 1 controller. Since I would rarely do this myself -- why would I play with the controller when testing the load/save code? -- I hadn't noticed it. But one day one of the artists was waiting for me to finish testing -- I'm sure I was cursing at the time -- and he was nervously fiddling with the controller. It failed. "Wait, what? Hey, do that again!"

Once I had the insight that the two things were correlated, it was easy to reproduce: start writing to memory card, wiggle controller, corrupt memory card. Sure looked like a hardware bug to me.

I went back to Connie and told her what I'd found. She relayed this to one of the hardware engineers who had designed the PlayStation 1. "Impossible," she was told. "This cannot be a hardware problem." I told her to ask if I could speak with him.

He called me and, in his broken English and my (extremely) broken Japanese, we argued. I finally said, "just let me send you a 30-line test program that makes it happen when you wiggle the controller." He relented. This would be a waste of time, he assured me, and he was extremely busy with a new project, but he would oblige because we were a very important developer for Sony. I cleaned up my little test program and sent it over.

The next evening (we were in LA and he was in Tokyo, so it was evening for me when he came in the next day) he called me and sheepishly apologized. It was a hardware problem.

I've never been totally clear on what the exact problem was, but my impression from what I heard back from Sony HQ was that setting the programmable timer to a sufficiently high clock rate would interfere with things on the motherboard near the timer crystal. One of these things was the baud rate controller for the memory card, which also set the baud rate for the controllers. I'm not a hardware guy, so I'm pretty fuzzy on the details.

But the gist of it was that crosstalk between individual parts on the motherboard, and the combination of sending data over both the controller port and the memory card port while running the timer at 1 kHz would cause bits to get dropped... and the data lost... and the card corrupted.

This is the only time in my entire programming life that I've debugged a problem caused by quantum mechanics.

Footnotes for posterity:

A few people have pointed out that this bug really wasn't a product of quantum mechanical effects, any more than any other bug is. Of course I was being hyperbolic mentioning quantum mechanics. But this bug did feel different to me, in that the behavior was -- at least at the level of the source code -- non-deterministic.

Some people have said I should have taken more electronics classes. That is absolutely true; I consider myself a "full stack" programmer, but my stack really only goes down to hand-writing assembly code, not to playing with transistors. Perhaps some day I will learn more about the "bare metal"...

Finally, a few have questioned whether a better development methodology would have prevented this kind of bug in the first place. I don't think so, but it's possible. I use test-driven development for some coding tasks these days, but it's doubtful we could have usefully applied these techniques given the constraints of the systems and tools we were using.

Promoted by Coverage.com

Johnny M

Master's Degree from Harvard University (Graduated 2011)

· Dec 3

Is there a secret to auto insurance that will save money?

I once met a man who drove a modest Toyota Corolla, wore beat-up sneakers, and looked like he’d lived the same way for decades. But what really caught my attention was when he casually mentioned he was retired at 45 with more money than he could ever spend. I couldn’t help but ask, “How did you do it?”

He smiled and said, “The secret to saving money is knowing where to look for the waste—and car insurance is one of the easiest places to start.”

He then walked me through a few strategies that I’d never thought of before. Here’s what I learned:

1. Make insurance companies fight for your business

Mos

He smiled and said, “The secret to saving money is knowing where to look for the waste—and car insurance is one of the easiest places to start.”

He then walked me through a few strategies that I’d never thought of before. Here’s what I learned:

1. Make insurance companies fight for your business

Most people just stick with the same insurer year after year, but that’s what the companies are counting on. This guy used tools like Coverage.com to compare rates every time his policy came up for renewal. It only took him a few minutes, and he said he’d saved hundreds each year by letting insurers compete for his business.

Click here to try Coverage.com and see how much you could save today.

2. Take advantage of safe driver programs

He mentioned that some companies reward good drivers with significant discounts. By signing up for a program that tracked his driving habits for just a month, he qualified for a lower rate. “It’s like a test where you already know the answers,” he joked.

You can find a list of insurance companies offering safe driver discounts here and start saving on your next policy.

3. Bundle your policies

He bundled his auto insurance with his home insurance and saved big. “Most companies will give you a discount if you combine your policies with them. It’s easy money,” he explained. If you haven’t bundled yet, ask your insurer what discounts they offer—or look for new ones that do.

4. Drop coverage you don’t need

He also emphasized reassessing coverage every year. If your car isn’t worth much anymore, it might be time to drop collision or comprehensive coverage. “You shouldn’t be paying more to insure the car than it’s worth,” he said.

5. Look for hidden fees or overpriced add-ons

One of his final tips was to avoid extras like roadside assistance, which can often be purchased elsewhere for less. “It’s those little fees you don’t think about that add up,” he warned.

The Secret? Stop Overpaying

The real “secret” isn’t about cutting corners—it’s about being proactive. Car insurance companies are counting on you to stay complacent, but with tools like Coverage.com and a little effort, you can make sure you’re only paying for what you need—and saving hundreds in the process.

If you’re ready to start saving, take a moment to:

Saving money on auto insurance doesn’t have to be complicated—you just have to know where to look. If you'd like to support my work, feel free to use the links in this post—they help me continue creating valuable content.

Related questions

How do I minimize bugs in my code?

What was the most difficult code that you ever had to debug, and how long did take to get the job done, if it was even possible? Could you prove your fixed code correct?

What are some aspects of C++ that makes it easy to create bugs?

What is the most difficult bug you've ever had to fix?

Does debugging code ever get easier?

Jock McTavish

poet and retired wayfarer · Author has 3.8K answers and 7.6M answer views

· Updated 8y

I get motion sick. Always have. Joined the Navy to kill or cure it. Did both. Learned to use Dramamine/Gravol to fend off the nausea. Basically I start the day before, and during the actual flight take a few more.

So anyway, I’m running an avionics shop in Calgary and trying to get new customers. My best ploy is to ask to get a chance to fix some difficult avionics snag they have in their fleet. So that’s what I did to Time Air back in the 80’s. Time Air had a few older Convair 580’s with first gen Sperry flight control systems.

And sure enough, they had such a problem. They had a 580 where the

And sure enough, they had such a problem. They had a 580 where the pilot’s flight director would topple once in a while, ever since they acquired the plane and no one could fix it. They had sent out for testing, every single component in the system with nothing discovered.

So I jumped at the chance and said that’s exactly the sort of snag I loved, and when could I test-fly the aircraft, for that had to be where the problem was? Well an engine change was due a week later so they scheduled me to trouble shoot on that test flight. And that gave me a week to memorize the system.

I arrive on the appointed day, and prepared for my testing. I had flashlights and oscilloscopes and meters and service manuals and blueprints ready. I even took the lid off the vertical gyro so I could see the relation between the actual gyro and the pilot’s flight director.

But first the pilots have to flight test the engine. Well when you take an airliner and have no freight or passengers and very little fuel, you basically have a lively overpowered aircraft pretty much with the same power weight ratios as a fighter aircraft! And that’s what the pilot’s thought they were flying. Having a fine time with fast climbs and sharp turns!

Oops - the bottom fell out. I got so sick that I had to write notes to people because I couldn’t speak, and crawl on my hands and knees because I couldn’t stand. And my thinking became ponderous, slow and heavy.

But I found the bug.

All aircraft gyro systems have two rates of alligning the gyro with the earth: fast and slow. Fast erection is called for during startup as the gyro mass is spooling up to its full speed for maximum inertial stability, and it is fast erection that ensures that it’s ready to use. Slow erection is called for after the startup and continues as the equipment is in use. It is the reduced earth’s gravity correction that maintains the correct attitude correlation to the earth while the aircraft bounces around the sky.

While the pilots were re-enacting the Battle of Britain, I saw the fault develop. It was the gyro itself toppling and the instrumentation was faithfully displaying the fault. There were no slow erect commands getting to the gyro. As long as the flights were gentle and coordinated the gyro would stay true to the earth, but if not, the precessional forces would not be corrected and the gyro would topple.

So we had the evidence we needed and returned to base. There I found there was no wire connecting the control box to the gyro. Then comparing the aircraft blueprints with the Sperry drawings I found the actual bug. There was an error on the Sperry system wiring drawing - the needed slow erect wire was not drawn. And Allegheny Airlines, the first owner, installing the system in the 70’s, followed the Sperry plan (mfr in 60’s). A 30 year old blooper. But because Allegheny’s hangar burned down in the 80’s the maintenance records were lost.

What a humungously horrifying and yet gladifyingly glorious experience! This was the aircraft:

Carl Henshaw

Space roboticist · Author has 386 answers and 2.9M answer views

· 11y

When I was in grad school, I was the lone programmer in a lab full of engineers, so I got to write basically all the code for this new robot we were building:

I had decided to use a new single board computer that used a compact PCI bus, which at the time was a brand-new standard. It was very expensive - $25,000 - which was a whole lot of money for a university lab, but the computer had specs that we just couldn't beat with other existing single-board computers at that time.

There were no available compact PCI motor controller boards, so we had to use a motor controller board that was build for a

When I was in grad school, I was the lone programmer in a lab full of engineers, so I got to write basically all the code for this new robot we were building:

There were no available compact PCI motor controller boards, so we had to use a motor controller board that was build for a different bus standard, and then convert from the Compact PCI system to the other board using a bridge chip. The particular motor controller board we chose was based on an 8-bit motor controller IC, the LM629. This particular chip uses memory-mapped 8-bit registers, and in order to communicate with it you have to write and read the registers in a very specific order. If you do anything in the wrong order, or you try to write to a read-only register, or vice-versa, the chip generates an error.

I was a really good C programmer at that time, so I was able to crank out the code in two days. But it didn't work. Whenever we tried to communicate with the chip, it threw an error. I went over the code with a fine-toothed comb, and I was absolutely certain it was all correct. I had no idea what was wrong. I was looking pretty bad to my advisor; I was the C stud, and I couldn't even write this simple device driver. And worse, I had recommended that we use this particular computer system, which cost $25,000, far more expensive than any other SBC we had ever bought, and now I couldn't make the thing work.

Finally, after banging my head against it for a week and making no progress, we threw up our hands and asked the motor controller board vendor if we could bring our system to their facility and get their help debugging it.

We arrived at the vendor and set up. Their programmer checked my code, and he couldn't find anything wrong with it either. After two days the owner took pity on us and asked his best engineer, a digital logic expert, to help us. He carted in a $20,000 digital logic analyzer and hooked it up and had me run my code. What he discovered was that when I had issued an eight-bit read, the chip saw a 16-bit read, which it wasn't expecting, so it threw an error, because the high-order byte was getting read from a write-only register. But the code was clearly issuing an 8-bit read. So where was the 16-bit read coming from?

It turned out the bridge chip had a bug. When it saw an incoming 8-bit read request on one bus, it translated it into a 16-bit read on the other, then threw away the most significant byte. We called the manufacturer, and were told "that's known, documented behavior - it's clearly spelled out in the manual." And when we checked, sure enough, it was - it was mentioned on page 48 in the third footnote, in 8-point type.

The solution we eventually came up with was to cut all of the memory address lines on the motor controller board and shift them to the right by one, and then take the lowest bit line and connect it to highest line on the chip. That way, access requests to any odd 16-bit memory address would map into unmapped register space so the chip wouldn't see them. Worked like a charm, as long as you remembered to only use even memory addresses. But I still feel sorry for the grad students who had that robot after we graduated. There was no way they ever figured out what we had done.

Promoted by Grammarly

Grammarly

Great Writing, Simplified

· Updated Dec 17

What are the best tips for essay writing?

If you want to write better essays, it’s helpful to understand the criteria teachers use to score them. Instead of solely focusing on the grade you are given, focus on how you are being graded and how you can improve, even if you are already getting a high grade.

Development of Your Thesis

A thesis is the essence of your paper—the claim you are making, the point you are trying to prove. All the other paragraphs in your essay will revolve around this one central idea. Your thesis statement consists of the one or two sentences of your introduction that explain what your position on the topic at ha

Development of Your Thesis

Strong Form

A good essay presents thoughts in a logical order. The format should be easy to follow. The introduction should flow naturally to the body paragraphs, and the conclusion should tie everything together. The best way to do this is to lay out the outline of your paper before you begin. After you finish your essay, review the form to see if thoughts progress naturally. Ensure your paragraphs and sentences are in a logical order, the transitions are smooth so that the paragraphs are coherently connected, and that your body paragraphs relate to the thesis statement.

Style

Just as your clothes express your personality, the style of your essay reveals your writing persona. You demonstrate your fluency by writing precise sentences that vary in form. A mature writer uses various types of sentences, idiomatic phrases, and demonstrates knowledge of genre-specific vocabulary, all the while ensuring the writing reflects your authentic voice.

Conventions

Conventions include spelling, punctuation, sentence structure, and grammar. Having lots of mistakes suggests carelessness and diminishes the credibility of your arguments. Furthermore, because most essays are written on computers these days, there is a lower tolerance for spelling mistakes, which can easily be avoided with spell-checking tools such as Grammarly. Beyond spelling, Grammarly can also help to weed out other major grammatical errors. Follow up with a close reading of your entire paper.

Support and References

Finally, your teacher will examine your resources. Select information from reliable websites, articles, and books. Use quotes and paraphrases to support your ideas, but be sure to credit your sources correctly. Also, always remember that copying five consecutive words or more from any source constitutes plagiarism. If you are concerned about unintentionally quoting your sources, Grammarly Pro offers a plagiarism detector so you can always double-check your work.

The grades you get on your essays are important, but you can never improve your writing if they are the only things you consider. Focus on improving your essays’ overall structure—the thesis development, form, style, conventions, and support. Learning to master these five elements will cause your scores to soar!

Stan Hanks

And the answer is... · Upvoted by

Nupul Kukreja

, Ph.D. Computer Science & Software Engineering, University of Southern California and

Mushtaq Jameel

, studied Software Engineering at UCSI University · Author has 8K answers and 90.2M answer views

· 11y

The One That Wasn't There...

Mid-80s, I'm working as a consultant for a medical devices company working on a new generation of Positron emission tomography scanners. Hush-hush, race to beat the big guys to market with a new technique for which the paper hasn't even been refereed yet.

I'm the real-time UNIX guy, doing an embedded system. I've done this a zillion times before for other types of control systems, this is my third medical device.

I think about it, doodle, prototype a little, then in a three day spree sprint code the whole thing. Type "make install" and watch it compile, build, constr

The One That Wasn't There...

I'm the real-time UNIX guy, doing an embedded system. I've done this a zillion times before for other types of control systems, this is my third medical device.

I think about it, doodle, prototype a little, then in a three day spree sprint code the whole thing. Type "make install" and watch it compile, build, construct a download package, download to the device, and reboot.

And it worked. Perfectly.

I. Could. Not. Believe. It.

You NEVER write code that actually runs the first time, it's just a stub to kick off the debugging. Everyone knows that.

So I ripped it apart, stuck in debugging statements, hooked up logic analyzers out the wazoo, and spent literally a month on it.

IT HAS TO BE IN THERE SOMEWHERE

I knew it, the project manager knew it. We even brought in a colleague to provide a second set of eyes.

Nope. After 6 weeks, we declared defeat. Or victory, as you prefer. There was no bug. It really did work, perfectly, from the first time.

And no, I've never replicated that since.

Andrew Daviel

Writing bad code since the 1970's · Author has 29.1K answers and 40.4M answer views

· 7mo

Here’s a hardware one; not particularly hard but it sticks in my mind.

I’d designed a data acquisition module for a physics experiment. The production run was not that large so we assembled them in-house, using people not robots.

Here’s a hardware one; not particularly hard but it sticks in my mind.

I’d designed a data acquisition module for a physics experiment. The production run was not that large so we assembled them in-house, using people not robots.

One of the modules failed commissioning tests, so I was debugging it - sending commands via the instrument bus and tracing signals with a logic analyzer and oscilloscope. The initial inspection showed nothing obviously wrong - no missing parts, no solder bridges. I found a missing clock signal - there was supposed to be a pulse on a particular circuit trace to clock data into a register, but there wasn’t. The signals going to the chip that generated the pulse were OK, but it wasn’t there on the output. It looked like a short-circuit, so I checked with a meter. No, high-impedance as it ought to be.

After some head scratching and looking at the circuit board some more, I finally spotted a decoupling capacitor in the wrong place.

Not my circuit board, but a capacitor like C7, though rather smaller.

In digital circuits, a large instantaneous current can flow when a register changes from 0x0000 to 0xFFFF - all zeros to all ones. So we use low-inductance power and ground planes, and capacitors placed close to chips to provide the required electric charge within a few picoseconds. Electricity travels about a foot per nanosecond, and waiting for current to arrive from the power supply would take too long. This particular capacitor was in the wrong place, between a through-hole connection to a ground plane (as it ought to be) and a signal via - one of the small holes that connect traces on one side of the board to the other On my board, the vias were bigger than the ones in the picture, and the small diameter leads on the capacitor fit. The capacitor had a sufficiently large value to absorb the narrow clock pulse. It was effectively open to DC, but a short-circuit to AC.

Promoted by Cash Canvas

Ethan Anderson

Senior Writer at CashCanvas (2024–present)

· Updated Mon

What are the biggest missed opportunities for building wealth that most people don’t know about?

1. Overlook how much you can save when shopping online

Many people overpay when shopping online simply because price-checking across sites is time-consuming. Here is a free browser extension that can help you save money by automatically finding the better deals.

Auto-apply coupon codes – This friendly browser add-on instantly applies any available valid coupon codes at checkout, helping you find better discounts without searching for codes.
Compare prices across stores – If a better deal is found, it alerts you before you spend more than necessary.

Capital One Shopping users saved over $800 millio

1. Overlook how much you can save when shopping online

Auto-apply coupon codes – This friendly browser add-on instantly applies any available valid coupon codes at checkout, helping you find better discounts without searching for codes.
Compare prices across stores – If a better deal is found, it alerts you before you spend more than necessary.

Capital One Shopping users saved over $800 million in the past year, check out here if you are interested.

Disclosure: Capital One Shopping compensates us when you get the browser extension through our links.

2. Overpaying on Auto Insurance

Believe it or not, the average American family still overspends by $461/year¹ on car insurance.

Sometimes it’s even worse: I switched carriers last year and saved literally $1,300/year.

Here’s how to quickly see how much you’re being overcharged (takes maybe a couple of minutes):

Pull up Coverage.com – it’s a free site that will compare offers for you
Answer the questions on the page
It’ll spit out a bunch of insurance offers for you.

That’s literally it. You’ll likely save yourself a bunch of money.

3. Not Investing in Real Estate (Starting at Just $20)

Real estate has long been a favorite investment of the wealthy, but owning property has often felt out of reach for many—until now.

With platforms like Ark7, you can start investing in rental properties with as little as $20 per share.

Hands-off management – Ark7 takes care of everything, from property upkeep to rent collection.
Seamless experience – Their award-winning app makes investing easy and efficient.
Consistent passive income – Rental profits are automatically deposited into your account every month.

Now, you can build your own real estate portfolio without needing a fortune. Ready to get started? Explore Ark7’s properties today.

4. Wasting Time on Unproductive Habits

As a rule of thumb, I’d ignore most sites that claim to pay for surveys, but a few legitimate ones actually offer decent payouts.

I usually use Survey Junkie. You basically just get paid to give your opinions on different products/services, etc. Perfect for multitasking while watching TV!

Earn $100+ monthly – Complete just three surveys a day to reach $100 per month, or four or more to boost your earnings to $130.
Millions Paid Out – Survey Junkie members earn over $55,000 daily, with total payouts exceeding $76 million.
Join 20M+ Members – Be part of a thriving community of over 20 million people earning extra cash through surveys.

With over $1.6 million paid out monthly, Survey Junkie lets you turn spare time into extra cash. Sign up today and start earning from your opinions!

5. Paying off credit card debt on your own

If you have over $10,000 in credit cards - a debt relief program could help you lower your total debt by an average of 23%.

Lower your total debt – National Debt Relief works with creditors to negotiate and settle your debt for less than you owe.
One affordable monthly payment – Instead of managing multiple bills, consolidate your payments into one simple, structured plan.
No upfront fees – You only pay once your debt is successfully reduced and settled, ensuring a risk-free way to tackle financial burdens.

Simple as that. You’ll likely end up paying less than you owed and could be debt free in 12-24 months. Here’s a link to National Debt Relief.

6. Overspending on Mortgages

Overpaying on your mortgage can cost you, but securing the best rate is easy with Bankrate’s Mortgage Comparison Tool.

Compare Competitive Rates – Access top mortgage offers from trusted lenders.
Personalized results – Get tailored recommendations based on your financial profile.
Expert resources – Use calculators to estimate monthly payments and long-term savings.

Don’t let high rates limit your financial flexibility. Explore Bankrate’s Mortgage Comparison Tool today and find the right mortgage for your dream home!

7. Ignoring Home Equity

Your home can be one of your most valuable financial assets, yet many homeowners miss out on opportunities to leverage its equity. Bankrate’s Best Home Equity Options helps you find the right loan for renovations, debt consolidation, or unexpected expenses.

Discover top home equity loans and HELOCs – Access competitive rates and terms tailored to your needs.
Expert tools – Use calculators to estimate equity and project monthly payments.
Guided decision-making – Get insights to maximize your home’s value while maintaining financial stability.

Don’t let your home’s value go untapped. Explore Bankrate’s Best Home Equity Options today and make your equity work for you!

8. Missing Out on Smart Investing

With countless options available, navigating investments can feel overwhelming. Bankrate’s Best Investing Options curates top-rated opportunities to help you grow your wealth with confidence.

Compare investments – Explore stocks, ETFs, bonds, and more to build a diversified portfolio.
Tailored insights – Get tailored advice to match your financial goals and risk tolerance.
Maximize returns – Learn strategies to optimize investments and minimize risks.

Take control of your financial future. Explore Bankrate’s Best Investing Options today and start building a stronger portfolio today!

Disclaimer:

Found is a financial technology company, not a bank. Business banking services are provided by Piermont Bank, Member FDIC. The funds in your account are FDIC-insured up to $250,000 per depositor for each account ownership category. Advanced, optional add-on bookkeeping software available with a Found Plus subscription. There are no monthly account maintenance fees, but transactional fees for wires, instant transfers, and ATM apply. Read more here: Fee Schedule

Fredrik Zettergren

Works at Sensebit · Upvoted by

Mushtaq Jameel

, studied Software Engineering at UCSI University and

Tom McNamara

, Co-developer of Envelope, a visual engineering simulation system

· Updated 8y

This one didn't take long to figure out, but it was one of my weirdest debugging experiences of all time.

So, while studying electrical engineering at school, we had a class where you were supposed to do a project related to embedded systems. Two classmates and I really enjoyed the course and decided to build an autonomous RC helicopter. We attached an MCU to a helicopter to control the servos with some input from an accelerometer. The project was a little too ambitious for the 3 month class, especially since we were all new to embedded systems. But it was very fun, and we worked hard, so thing

This one didn't take long to figure out, but it was one of my weirdest debugging experiences of all time.

One late evening we sort of had all the different parts of the system working and were ready to mount it all on the helicopter to start doing trial flights. The only problem was that once we started the system, the servos went bananas every now and then. We went over all the code several times, removed more and more pieces from the system but still couldn't get rid of this behavior.

After a long night of debugging where fewer and fewer things were making sense we didn't really know what to do. One of the team members got so tired of everything that he leaned back, put his shoes up on the table and closed his eyes for a while. Suddenly, the bug didn't appear any more. Tired and out of ideas we started joking around about his shoes maybe being a magical cure for the bug. He played along and started taking his shoes up an down from the table. The freaky thing was that the bug actually wouldn't appear when his shoes were on the table, but did appear when they weren't. After a while we actually started considering that there could be a correlation. Half laughing, half crying of exhaustion we actually did 10-20 runs with feet on table / feet off table and the bug happened exclusively when his feet were off the table. I think this probably was one of my most confusing moments in life, at least related to technology.

That's when it hit us. Common ground! Turns out we forgot to connect ground between two parts of the system which led to communication between them being extremely unstable and sensitive to pretty much anything. When my teammate put his feet on the table, he connected ground between the two parts of the system with his shoe and the table where the other part of the system was located. Even though this connection was probably extremely weak, it was enough to make the communication a little more stable. As soon as we realized what happened, we connected the missing wire and everything was running perfectly (well, at least in regards to that problem).

I guess we were lucky to stumble on the solution by accident relatively quickly so that it got to be more of a fun than a painful memory.

Quora User

Muslim, Software Engineer · Upvoted by

Fabrice Kwizera

, M.S Software Engineering & Data Science, Carnegie Mellon University (2019) · Author has 423 answers and 1.4M answer views

· Updated 10y

A couple years ago, there was a crash in the Flash Player that was reported to us by both Mozilla and Microsoft. None of us could reproduce the crash, we'd know where the crash was due to a crash log, but it'd make no sense. In fact, there were several crash logs pointing to different lines of code that were due to the same bug (as we later realized).

Finally, an awesome Quality Engineer on our team was able to hunt down a machine on which it did crash, and was able to come up with fairly reliable repro steps. It turned out that it only occurred when using slow hard drives.

The crash would occ

The crash would occur during the Flash Player's destruction sequence (like when you navigate to another web page in some cases), when a video was being deallocated. The video file stream wouldn't clear out in time, and exposed a thread synchronization issue.

The thing that made this bug so hard was that it was so hard to find a system to reproduce in-house, and the fact that there was some nasty multithreading going on around where the crash was.

I fixed that crash, and it turned out to be quite popular. It probably prevented tens of millions of crashes -- in a time when browsers would crash along with the plugin.

We felt like heroes :-)

Santosh Lakshman M

BITS, IIMC, currently PM @Microsoft · Author has 100 answers and 656.8K answer views

· 11y

Not my experience, but reproduced from here:

The case of the 500-mile email [ http://www.ibiblio.org/harris/500milemail.html ]

From

trey@sage.org

Fri Nov 29 18:00:49 2002
Date: Sun, 24 Nov 2002 21:03:02 -0500 (EST)
From: Trey Harris <

trey@sage.org

>
To:

sage-members@sage.org

Subject: The case of the 500-mile email (was RE: [SAGE] Favorite impossible
task?)

Here's a problem that *sounded* impossi

Not my experience, but reproduced from here:

The case of the 500-mile email [ http://www.ibiblio.org/harris/500milemail.html ]

From

trey@sage.org

Fri Nov 29 18:00:49 2002
Date: Sun, 24 Nov 2002 21:03:02 -0500 (EST)
From: Trey Harris <

trey@sage.org

>
To:

sage-members@sage.org

Subject: The case of the 500-mile email (was RE: [SAGE] Favorite impossible
task?)

Here's a problem that *sounded* impossible... I almost regret posting the
story to a wide audience, because it makes a great tale over drinks at a
conference. :-) The story is slightly altered in order to protect the
guilty, elide over irrelevant and boring details, and generally make the
whole thing more entertaining.

I was working in a job running the campus email system some years ago when
I got a call from the chairman of the statistics department.

"We're having a problem sending email out of the department."

"What's the problem?" I asked.

"We can't send mail more than 500 miles," the chairman explained.

I choked on my latte. "Come again?"

"We can't send mail farther than 500 miles from here," he repeated. "A
little bit more, actually. Call it 520 miles. But no farther."

"Um... Email really doesn't work that way, generally," I said, trying to
keep panic out of my voice. One doesn't display panic when speaking to a
department chairman, even of a relatively impoverished department like
statistics. "What makes you think you can't send mail more than 500
miles?"

"It's not what I *think*," the chairman replied testily. "You see, when
we first noticed this happening, a few days ago--"

"You waited a few DAYS?" I interrupted, a tremor tinging my voice. "And
you couldn't send email this whole time?"

"We could send email. Just not more than--"

"--500 miles, yes," I finished for him, "I got that. But why didn't you
call earlier?"

"Well, we hadn't collected enough data to be sure of what was going on
until just now." Right. This is the chairman of *statistics*. "Anyway, I
asked one of the geostatisticians to look into it--"

"Geostatisticians..."

"--yes, and she's produced a map showing the radius within which we can
send email to be slightly more than 500 miles. There are a number of
destinations within that radius that we can't reach, either, or reach
sporadically, but we can never email farther than this radius."

"I see," I said, and put my head in my hands. "When did this start? A
few days ago, you said, but did anything change in your systems at that
time?"

"Well, the consultant came in and patched our server and rebooted it.
But I called him, and he said he didn't touch the mail system."

"Okay, let me take a look, and I'll call you back," I said, scarcely
believing that I was playing along. It wasn't April Fool's Day. I tried
to remember if someone owed me a practical joke.

I logged into their department's server, and sent a few test mails. This
was in the Research Triangle of North Carolina, and a test mail to my own
account was delivered without a hitch. Ditto for one sent to Richmond,
and Atlanta, and Washington. Another to Princeton (400 miles) worked.

But then I tried to send an email to Memphis (600 miles). It failed.
Boston, failed. Detroit, failed. I got out my address book and started
trying to narrow this down. New York (420 miles) worked, but Providence
(580 miles) failed.

I was beginning to wonder if I had lost my sanity. I tried emailing a
friend who lived in North Carolina, but whose ISP was in Seattle.
Thankfully, it failed. If the problem had had to do with the geography of
the human recipient and not his mail server, I think I would have broken
down in tears.

Having established that--unbelievably--the problem as reported was true,
and repeatable, I took a look at the

sendmail.cf [ http://sendmail.cf ]

file. It looked fairly
normal. In fact, it looked familiar.

I diffed it against the

sendmail.cf [ http://sendmail.cf ]

in my home directory. It hadn't been
altered--it was a

sendmail.cf [ http://sendmail.cf ]

I had written. And I was fairly certain I
hadn't enabled the "FAIL_MAIL_OVER_500_MILES" option. At a loss, I
telnetted into the SMTP port. The server happily responded with a SunOS
sendmail banner.

Wait a minute...

Vivek Ponnaiyan

Founder, FitMountain.com - take photos of your food & let a dietitian guide you to better choices. · Author has 655 answers and 1.5M answer views

· Updated 9y

While working at the largest backend telecommunication equipment company in the world, on my checkin, 10 different types of backbone routers stopped communicating to their control servers.

That is, on 10 different platforms, there was no way to transfer the new code to the router. (It was actually more than 10 platforms, probably like 25-40, but since I can't remember I put in 10. Yes, 25-40 - all one code base in C ... #IFDEF hell :) )

So the whole company's build failed on my checkin. And I get a "nice" email from the CTO. Of course the bug wasn't in my code. :) ...

What had happened was tha

While working at the largest backend telecommunication equipment company in the world, on my checkin, 10 different types of backbone routers stopped communicating to their control servers.

So the whole company's build failed on my checkin. And I get a "nice" email from the CTO. Of course the bug wasn't in my code. :) ...

What had happened was that on my checkin, the size of the image on all these platforms had become an exact multiple of 1024 bytes. And I discovered that when the size of the image was an exact multiple of 1024 bytes a bug in the FTP code would make the transfer hang, because it would drop the second last packet. Insidious! If I remember correctly, it was one of those "off by 1" bugs.

I still can't believe I found that bug as I had never ventured near that code ever before, and it was a huge code base.

After finding the bug I got a congratulatory letter from the CTO and two sleepless nights :). The engineer assigned to fix the bug calls me up and with the magic of xkibitz I coded the fix in his repository which he then checked in and tested.

Yes, these were the days of single-threaded embedded systems.

Ryan Mack

Facebook Boston Site Lead

· 11y

It's not my hardest bug but it's one of the most fun to tell about. I was working on Midnight Club: Los Angeles for PS3 and 360, past first, second, maybe event third submission (Rockstar had a habit of leaning pretty hard on the console first party QA teams, probably one of the reasons why Microsoft now charges a ton of money for multiple submissions).

One particular race our designers are reporting their tires get kicked up a bit too high into the air when they drive over a single curb in a hairpin turn, causing them to go a bit wide and occasionally slam into a building. They can reproduce it 80% of the time, but only by running the first few minutes of the race, and the effect is pretty subtle.

They're playing the release build we submitted, so obviously I ask them to try to reproduce it in a debug build where we log the hell out of everything. In fact I owned much of the logging system because we used the system to detect divergences in the deterministic race replay logic and I had been responsible for instant replays.

No dice. 20 races later we still haven't seen it happen in a debug build. Maybe we imagined it. Fire up release again, and it's pretty clear the car is handling a bit different over this particular curb. Only one curb. In the entire freaking city of Los Angeles we modeled.

Needless to say, when we're past third submission and we have to ship for the holidays or we're all doomed, we don't want to waste any time. I start bisecting possible changes between our release and debug configurations. Eventually we get it down to "logging enabled" = "car drives fine," "logging disabled" = "tires bounce off the curb too much." Well shit.

We enabled the logging function calls in release builds but left the functions themselves as empty stubs. Never did figure it out. Game shipped. Onto the next project. I figure it was uninitialized stack variables but we'll never know.

Mourya Venkat

Former Senior Software Developer at Goibibo (product) (2019–2021) · Author has 403 answers and 5.3M answer views

· 5y

Error Pattern recognition.

Well if there is a bug in the code we’ve written somehow by spending some extra time we can fix that out. But what if the bug is in Code Editor. It’s gonna freak you out.

I’ve recently faced this issue with VSCode and almost spent 2 days time debugging it. Below is the problem statement. Read it till the end, I guarantee it thrills you.

Problem Statement :

When our GO microservice started in 2016 there was no official client library for Apache Kafka in Golang. So we planned to use Sarama a open source Kafka Client for Golang. But recently Confluent the company that maint

Error Pattern recognition.

Well if there is a bug in the code we’ve written somehow by spending some extra time we can fix that out. But what if the bug is in Code Editor. It’s gonna freak you out.

I’ve recently faced this issue with VSCode and almost spent 2 days time debugging it. Below is the problem statement. Read it till the end, I guarantee it thrills you.

Problem Statement :

Here comes the problem. The Confluent Kakfa client internally imports functions written in C. So for every Go file written in Kafka, the underlying operations are completely based out of C language.

If you see this code, this is for Kafka Consumer and at line number 20 you see there is an import called ‘C’.

The problem with this package is that, you cant import it and go compile time handles this package. All the code written as comments will also be interpreted by the Go Compile time.

But Poor VSCode don’t know this and as usual if a package is not found in GOPATH, it stops auto suggestions and recommendations.

This became a hell of task for me because if I was not able to see the underlying structs and their properties without which I can't write my Kafka client to produce or consume.

I started my debugging, read tons of references in VSCode Github issues but none of them had an answer. But after 2 days, I found a pattern in the way VSCode works.

So this is what I’ve did.

I went into each go file of confluent-kafka-go package and flipped the non importable C package to the top and importable packages to the bottom.

And kaboooom, it started working and started showing me all the recommendations.

With VSCode the pattern is that, if the last imported package is not found it won’t show you any suggestions. But if the same package is on the top and all the bottom packages are importable then it works as it had to be.

The result is I have to change the same in these many files for the auto recommendations to work again. Hope this helps for anyone who is setting up Confluent Kafka client with VSCode.

Riki Fridrich

I'm so great that I'm my own idol. · Author has 122 answers and 1M answer views

· 11y

A typo in variable named "smallIllustration".

Forest Kirst

Former Sat in Left and Right Seats of Aerolanes · Author has 3.7K answers and 12.7M answer views

· 8mo

Teaching “smart” people to read the manual. It is the biggest bug in most systems.

Yep they have PhDs, MS government positions, funding and all sorts of fancy credentials but wont read the manual.

Why wont the imu in my fancy machine stabilize? its broken. We need a new $15000 imu. Because The manual says to turn it on and let it run for 5 minutes parked on the ground so it can stabilize. It cant stabilize in flight when its moving with the airplane.

This gps is supposed to give 50 foot accuracy why is it off by 500 feet? Because the satellite system cant give 50 foot accuracy in this area withou

Teaching “smart” people to read the manual. It is the biggest bug in most systems.

Yep they have PhDs, MS government positions, funding and all sorts of fancy credentials but wont read the manual.

This gps is supposed to give 50 foot accuracy why is it off by 500 feet? Because the satellite system cant give 50 foot accuracy in this area without a correction signal and there is no correction signal. That's in the faa manual on GPS.

Why is my Lazer gas chromatography giving false readings? I calibrate it before use every time. Because the manual says you are supposed to calibrate it with distilled water not that stupid store bought flavoured water which has different chemicals in every bottle and flavour.

This fancy school attendance program cant print out daily class attendance between periods. We need to get a whole new system to catch students ditching one class. Try entering ‘xvxvxvxv’ like the manual says on page 36. See how easy that was!

Read the damn manuals. I solved all these problems and more for the offending specialists. By reading the manuals.

Elad Raz

CTO at Integrity Project

· 11y

For me, the hardest debug challenges are those where most of the effort is focused on writing the correct debugging facilities. Here is my story...

The year was 2005 and a customer had asked me to debug a machine that was running Windows 2000 (SP4). The machine was crashing, displaying a Blue-screen (back then, Blue-screens were common) and wasn’t creating any dump files. To give you a better idea of what I was seeing, here is a snapshot from Welcome to Flickr - Photo Sharing which displays similar symptoms:

(Taken from fogindex @ Welcome to Flickr - Photo Sharing)

When debugging, it’s always a

For me, the hardest debug challenges are those where most of the effort is focused on writing the correct debugging facilities. Here is my story...

(Taken from fogindex @ Welcome to Flickr - Photo Sharing)

When debugging, it’s always a good idea to understand the history of the bug as well as the current scenario, so I asked the customer to give me more details. That’s where things started to get complicated. It turns out that the bug was presenting itself ONLY when the machine was tested in the field - and when I say field, I mean it literally: a field. The crash was occurring only after the PC was being driven through rough terrain, sometimes a day after and sometimes a week after. To further complicate the situation, when the customer added 4G of memory, things worked much better - the crash appeared every couple of weeks. When I asked what the heck they were running, I received the following reply: a 2GB Visual Basic application(!!).

The fact that the bug was rare made it more difficult to catch and frustrating. For this reason, I generally advise users not to attempt their own work-around without first understanding the underlying cause of the problem.

Ok, now it was time to get to work. So what did I know so far? There was a standard PC machine, running an unmodified Windows 2000 kernel with a 2GB Visual Basic application, that was crashing every 2-4 weeks while driving in a field. Well, it wasn’t exactly a good starting point. Debugging kernel panic with only four numbers…

Looking at the Blue-screen images from previous crashes I saw two types of crashes: KERNEL_STACK_INPAGE_ERROR and KERNEL_DATA_INPAGE_ERROR. In both crashes the second parameter was STATUS_NO_SUCH_DEVICE. Reversing the Windows kernel, it was clear that the crash could have originated from only eight places, and in all of them the kernel tried to do a page-in (load a page from a cache to the memory) and failed. So the main challenge was to debug a crash that was happening once every two weeks and wasn’t leaving behind a dump file or any other debug information.

How to debug such a crash? Well, a lot of debugging infrastructure needed to be written. One of the challenges was to display the logging information - since for unknown reasons the machine didn’t generate any dump file. I worked around this problem by replacing the Microsoft Blue-screen window with a “green” screen that displayed dedicated information. Here is a (real) example:

Whenever a Blue-screen displayed, I redrew a “green” screen with a stack trace of the crash (Enumerate stack frame using EBP register chain), and displayed information on the active devices to see why I got “STATUS_NO_SUCH_DEVICE”.

For the kernel hacker out there, the way I did this was by patching KeBugCheckEx (the kernel function which is invoked for every kernel-panic) using “code patching” techniques. Replacing the assembly bytes of the KeBugCheckEx from their normal function header:

   0x55,                                  // push ebp 
    0x8B, 0xEC,                            // mov  ebp, esp 
    0x81, 0xEC, 0x74, 0x03, 0x00, 0x00     // sub  esp, 374h

into a jump call to my function:

0xE9, <Relative address> // jmp MyBugCheckHandler

The new jump invoked my function - “MyBugCheckHandler” - which displayed a green-screen using boot display API (e.g. InbvIsBootDriverInstalled). The function cleared the interrupt flag, avoiding any unwelcome context-switch. Since the function never exited, I could later take a digital camera and photograph my messages.

Since “MyBugCheckHandler” is just a function, it can call other kernel API’s. One of the API’s that I have used is Plug&Play API, in order to scan all devices (of FILE_DEVICE_DISK type) and see which device has been removed.

The result of the test was that for some unknown reason the hard drive (ATAPI device) was removed from the system, but I couldn’t understand when and why. It seemed like the drive was removed very early but the machine kept running. Only later did I connect the dots: since the customer had increased the memory to 4GB, there hadn’t been any paging activity until much later...

So, I modified my humble tool and created “Atamon”. Atamon is a kernel debugger that runs inside the kernel and places breakpoints in strategic places within the atapi.sys driver and logs them for future use:

The main purpose of the Atamon was to be able to read the ATAPI registers and display them. There wasn’t any other way to try to understand why the device was essentially committing suicide besides code-patching the atapi.sys driver.

Digging in, using the Atamon, I saw that at some point the disk decided to lock and remain locked. No matter what atamon.sys tried to do, and no matter the amount of resets to the controller, the BUSY bit of the ATAPI simply never went down. Changing Atamon to toggle the power line GPIO and forcing HW reset to the controller was the only thing that solved the issue. Furthermore, the Atamon tool could identify, in the field, the exact time of the ATAPI disk crash and helped understand the physical conditions leading to the crash.

And that, my fellow programmers, is the hardest bug I have ever come across. It totalled about one month of writing NT kernel-mode debugging infrastructure. Since then, I have fallen in love with my kernel debugger. I love it so much that I use it as a debugging tool for myself.

Instead of using WinDbg (which stalls the entire system and can’t gather information on runtime) to solve the problem I have just described to you, using this tool to count/remember to invoke API while debugging proved to be useful. The only problem is that it’s not a generic tool and only I can use it. Hopefully one day I’ll gather the time and energy to release it as an open-source product.

Jonas Mellin

imperative, OO, functional, declarative for nigh 4 decades · Author has 3.7K answers and 2.2M answer views

· 7mo

An error in Java TreeSet that caused an object to be in the set while not being in the set at the same time.

Essentially, if you asked tree_set.contains(obj) it returned false, but if you iterated over the set for (o in tree_set) { if (o==obj) return true; }, then it was found. My workaround was to add a facade to the TreeSet and replace the contains method with the slower iteration-based implementation.

The problem was that the fault occurred very infrequently and I had to set conditional breakpoints in the solution and narrow down the search in steps. To check if the fault occurred took about

An error in Java TreeSet that caused an object to be in the set while not being in the set at the same time.

The problem was that the fault occurred very infrequently and I had to set conditional breakpoints in the solution and narrow down the search in steps. To check if the fault occurred took about 5–10 minutes. There were lliterally thousands of sets, since it was an implementation of incremental and local anomaly detection.

It took me between 300–400 hours to track it down and once I found it I went WTF? I could not believe my eyes.

Then I have a hardware error in a Sun Sparc Ultra 2 that gave a memory exception while pointing at an machine instruction that did not memory access what so ever. That fault disappeared.

Till Hänisch

CS Professor

· Updated 9y

Years ago as a student I wrote a monitoring software for signals coming out of a special cardiac catheter. In the lab it worked like a charm. But whenever used in the emergency room, the software stalled after a few minutes. I spent a number of hours debugging there (which is not fun with a pretty sick and unhappy patient lying next to you) finally finding out that the X-ray generator flipped a few bits in RAM from time to time ....

Michael Di Prisco

Senior Full-Stack Developer · Author has 67 answers and 71.6K answer views

· 6y

Originally Answered: What was the most difficult bug fix you have ever dealt with? ·

It happened a couple of years ago.

I was working on a big e-commerce, earning milions per year. Didn’t know what was going on, but people wouldn’t be able to pay via Paypal (80% of the revenues). Everything was fine, every piece of code was correctly working and every single line of code was tested and var_dumped till death, but nothing was coming out, people just couldn’t pay. I spent almost 2 weeks 8/5 to try to fix the issue without finding any issue at all.. Every variable was correctly set, every parameter sent to and from Paypal was correct, but still nothing worked.. People just got back

It happened a couple of years ago.

At the 9th day of trying, I noticed a little thing. A little, stupid, meaningless line of code was commented. Line 19.636 of a single file, a middleware, a little piece of code used to properly point a user to a success page or another one, depending on their payment method. Well, I noticed that single line of code was closing an if statement pointing the user to the correct payment - and consequently success - page, and someone, while trying to fix the issue before me, just didn’t notice the commented line and put the ending brace a single line later, preventing the user from being pointed to the correct page. Of course, there was a try-catch situation which was covering any possible issue, but it was wrongly pointing the user to the bank payment page, and of course that was bringing the user to an error page because bank credentials were not fully registered in the database.

In the end, we were in fact getting money from our users, but no order have been completed for that whole amount of time, and our commercial department had to check every single order payment log one by one to properly confirm every order and ship the items our clients didn’t receive.

Ned Boff

Retired Electrical & Controls Engineer · Author has 22.8K answers and 17.2M answer views

· 7mo

Sporadic Network file transfer corruption, only happened during business hours, never repeatable by every method we could figure might cause such a thing….

Until we realized the operations staff had wrapped Christmas lights around the network drop poles… and in a couple instances near enough where the network wiring emerged from the poles that the “blinking” of the holiday lights would, on occasion if the timing was just right, mess up a packet in a way the error correction codes wouldn’t catch

Only happened when the lights were the blinking kind, solid illumination lights didn’t create the prob

Sporadic Network file transfer corruption, only happened during business hours, never repeatable by every method we could figure might cause such a thing….

Only happened when the lights were the blinking kind, solid illumination lights didn’t create the problem.

You can’t make this stuff up.

Faizan Ahmad

A Fulbright CS Student. · Author has 174 answers and 1.1M answer views

· 8y

A bug for getting a bounty from google. It took me 40 hours but eventually, I found one.

Basically, I had received some acknowledgements from companies but I really wanted a name on Google Hall of Fame. I knew that it would be hard to find a bug. I tried a lot but was unable to find one. I went on to university, came back, ate supper, started finding bug. This routine went on for almost 4 days and I spent almost 30–40 hours on finding just a single bug. I had to surf more than 100 google sub-domains and some acquisitions to find it. Although the bounty was less, I felt utter happiness when I re

A bug for getting a bounty from google. It took me 40 hours but eventually, I found one.

PS: The vulnerability was in an outdated plugin used in one of the acquisition’ website. A public exploit for that plugin was available.

Jeff Nelson

Invented Chromebook, #Xoogler · Author has 1.6K answers and 24.8M answer views

· 11y

I just posted this under another question, but it's also the hardest bug we ever tackled:

Years ago at eBay, we hit a number of versioning bugs during our transition to Java technology on our servers.

Some of the versioning problems were quite obvious. For example, I don't think anyone would be surprised to learn that you can't compile code against IBM's JVM 1.5 and then run the bytecode against Microsoft's JVM 1.4.

Later, we had similar problems when a project was transitioning to JVM 1.6. Code compiled for IBM's JVM 1.5 was compatible about 99.9% of the time with JVM 1.6, but that remaining

I just posted this under another question, but it's also the hardest bug we ever tackled:

Years ago at eBay, we hit a number of versioning bugs during our transition to Java technology on our servers.

Later, we had similar problems when a project was transitioning to JVM 1.6. Code compiled for IBM's JVM 1.5 was compatible about 99.9% of the time with JVM 1.6, but that remaining 0.1% was enough to cause serious headaches. We had to carefully assure that all compiled code was compiled for the appropriate JVM targets. Most Java developers are aware of these potential issues as well.

But then we hit the ultimate versioning problem: Different builds of IBM JVM 1.6 were incompatible with each other. Same major and minor version of the JVM, same manufacturer, just different builds. The problem manifested as a memory leak, though. Further confounding a solution, it was impossible to reproduce outside of production, because engineers were compiling and running on consistent versions of the JVM installed on their boxes.

That one took a significant amount of digging to figure out and caused intermittent outages for a period of about 4 weeks(*), because we just couldn't nail down the cause of the problem. We eventually brought in an IBM rep to help us diagnose the problem. No one remotely expected that different builds of the JVM with everything else equal, could itself be the cause of such an obscure problem.

The final solution was just to install the server JVM on every developer box, so that we could compile against the JVM server target. We also worked with IBM to nail down the root cause of the memory leak, when JVM builds were inconsistent.

The moral of the story, always compile against the very same version of your Java JVM and JRE that you intend to run against.

No pain, no gain: This incident had a silver lining, because it resulted in eBay building out a much more rigorous production debugging framework, and a production sandbox environment where engineers could more easily get access to running production servers for the purposes of testing bugs directly against production traffic.

(*) Fortunately eBay had enough redundancy, that there was no customer impact.

Gavin Baker

Software Architect

· 11y

My favourite bug of all time was uncovered because I was just too impatient!

I was working on a fairly large and complex embedded project. A microcontroller was interfacing with some custom hardware. It would configure, control, and monitor the hardware. The microprocessor was running an RTOS with a lightweight TCP/IP stack, so remote clients could connect to the device via Ethernet and perform remote configuration. We also used this for diagnostics and testing during development. We had successfully used a similar design on several products, with great results. We had also designed a prot

My favourite bug of all time was uncovered because I was just too impatient!

The first thing the microprocessor does when you connect is to dump the status
of the entire system. With smaller devices, this was very fast. But this new
model was so large and had many more components and subsystems, it was taking a very long time every time you connected. This device can also serve many concurrent clients, so this slow connection startup would be repeated for every client. The processing and time cost was not trivial, and this delay would
limit how quickly clients could start updating their own settings.

Slow and inefficient code bothers me. Eventually I got sick of waiting a good
10-20 seconds or so every time I connected before I could start working with the hardware to test my latest changes, so I decided to track down the performance problem. As a baseline, I measured the system performance before changing anything, and then got to work. It had to be something simple, after all - we've had this same code working on other products for many months in the field.

I figured a few hours would probably be enough to solve it. The debugging
process was made somewhat more difficult by the fact that there was no keyboard or monitor on this hardware; all debugging was either via JTAG and a serial USB connection, or by the Ethernet port - which was part of the problem.

The actual time to solve the mystery would be more like a week, with several long, late, pizza-fuelled nights, drilling through many layers and thousands of lines of code to find the very surprising root cause.

First layer: application protocol

The most obvious point to start with was the application level buffering. A
protocol layer abstracted away the TCP/IP networking (which also allowed for
serial interfaces) and had its own read and write buffers. I guessed it might
help to increase the buffer size a little, to have fewer networking calls. A
little increase didn't help much overall, nor did a big increase. Ok, so it's
not the protocol buffering.

The status dump on connection has to traverse many data structures, build
strings and create structured text for output. I reviewed the algorithms,
memory management and improved the code by reducing reallocations and copies, tidied things up and used pre-allocated buffers more efficiently.

I measured the performance improvement and noticed a modest difference. All that work helped - but it was still not enough. It was just too slow for my liking.
According to my calculations, the system should be capable of *much* higher
throughput. I decided to keep digging.

Second layer: networking and OS

Underneath the protocol code was a high level interface for networking. Perhaps this wasn't as efficient as it could be? After poring over the code, analysing potential performance issues, I found a few small issues, but no smoking gun.

Now this RTOS has many tasks (threads) running, only one of which is the
networking support. Could this networking task be getting starved for processing time? The interrupt handling latency should be guaranteed and well within the required time. I disabled all non-essential tasks, tried increasing the priority of the networking task, and various other tweaks. None had any impact. The RTOS kernel was perfectly happy and running smoothly.

Keep digging...

Third layer: TCP/IP tuning

The TCP/IP stack we were using has a boatload of parameters you can configure at build time, including its internal buffers. This was a prime candidate for performance issues. I dug up the documentation, went through our configuration, and sure enough - bingo! Several parameters were not at the recommended values for this version of the library. Some buffer sizes needed to be multiples of the packet sizes (eg MSS), and tuned to match other significant parameters. This could have caused fragmented packets or memory buffers, and introduce small but potentially disruptive delays to the flow.

This tuning process took many many hours, and eventually resulted in a decent
improvement in throughput. But was it enough? No - the big dump when the
connection was established wasn't just slow now, it was noticeably jerky and
very *bursty*. I really needed to see exactly what was happening between the two socket endpoints. I needed to understand why it was bursty - and fix it.

Fourth Layer: On the Wire

Having calculated the theoretical peak throughput, I decided there was no good
reason this microprocessor shouldn't be able to maintain a much higher level
of throughput. Time to do some low-level packet analysis.

I set up Wireshark and started capturing packets. At first, everything seemed
ok but looking at the timestamps showed clearly that the transmissions were very bursty. Sometimes there were delays of a few *seconds* between packets! No wonder it was taking so long for a full status dump... but what was causing
this?

Looking at the IP layer, I decoded and inspected the session piece by piece,
from the very first packet. `SYN, SYN-ACK, ACK...` All good so far. But after
transmitting only a few data packets: `NAK`. Retries? Backoff? Delays! What on
earth was going on? The trace showed the micro was resending packets it had
successfully sent. Yet by matching up the sequence numbers, it showed the
packets were being `ACK`ed by the other end. Eventually after receiving a few
out-of-order packets, the receiver tried to back off by increasing timeouts.
This perfectly illustrates the bursty nature of the traffic. But what could
be causing it?

Not leaving anything to chance, I tried changing Ethernet cables to make sure
it wasn't a dodgy connection causing the fault. No dice.

At this point, my best hunch pointed to a bug in the TCP/IP library. Resending
an already acknowledged packet? Madness! Since we had found bugs in this library before, it was quite conceivable. I upgraded the stack to the absolute latest version and reran all the tests. Same problem. Yet according to the forums and bug tracker, nobody else had reported this kind of problem with this stack
before.

I decided some major action was needed. I needed to partition the problem and eliminate large components to isolate the fault.

Isolation

First stop, to write a simple socket server which would accept a client
connection, and then send out packets in a tight loop, as fast as it could. This
would exercise the TCP/IP stack, driver and hardware without any of the protocol or application code. The packets contained a monotonic counter so I could see if any packets were being corrupted or lost.

Running this test and capturing packets on the wire revealed the same problem. A burst of traffic, a flurry of `ACK`s and `NAK`s followed by timeouts and
resends. Curses, foiled again!

Ok, how do I eliminate the TCP/IP stack from the equation? I constructed a UDP ping packet by hand, using parts of the wire capture data to fill in the
relevant fields (such as MAC addresses). I kept a monotonic sequence counter
and copied this into the binary ping blob at the correct offset, which I passed
directly to the driver, with my workstation hardcoded as the destination. I
started with a small delay, in the order of 100ms between ping packets. This
seemed to work ok. But as I decreased the delay, packets were being dropped.
Dropped?!

The only thing between this test and the wire is the device driver and hardware.
Could the driver be corrupting or dropping packets?

Fourth layer: device driver

A code review of the device driver didn't show up anything suspicious. Looking
at the memory management, interrupt handling - it all seemed quite carefully
written. Many hours later, no closer to the problem.

I pulled up the datasheet for the Ethernet controller and started querying the
status registers, halting the microprocessor and printing a diagnostic. There
were no clear errors to be found, so the driver did not appear to be causing the
hardware to fail sending or receiving data.

Fifth layer: hardware

The microprocessor has onboard Ethernet support, which is connected to a
separate MAC (Media Access Control) chip. This MAC performs the actual
electrical interfacing, and is the last piece of silicon before the wire. I
started reading the datasheet and looking at the initialisation sequence in the
driver, which configures the registers in the MAC on powerup. I verified the
correct register flags and values, but while I was reading I noticed there were
some counter registers which collected counts of certain types of media
(physical layer) errors.

I added some code to my minimalist hand-crafted ping test to read these counters from the MAC registers, showing the values before and after the ping burst. Sure enough, the counters were 0 on powerup, and after the ping test one of the error counters had a very large number. Ok, I think we're finally on to something...

Back on the wire

I modified the test program to send out hand-crafted `ARP` packets. The only
other code in play was the driver. I went back to Wireshark and captured
another session. This time, I exported the trace data to a file and analysed
the timing information in the headers.

I then stepped through and counted the number of successful packets sent before a failure. Then the next, and the next. And I started to notice a sort of
pattern. The gaps were something like 9, 17, 33, 51... and eventually it would
come back down and repeat. A regular pattern is very interesting, but what
could be causing this kind of failure?

Stepping back and looking at the regular pattern of successes and failures over
time was like looking at an interference pattern. Like ripples in a pond,
where the waves met, packets were dropped. A colleague observed that this
looked a bit like there were two slightly different frequencies involved...
Wait a minute!

Don't blame the Hardware

It was nearly midnight, and I desperately wanted to talk to the Hardware
Engineer who designed the system. But it would have to wait until the morning.
I fired off an email explaining what we had found, and went home exhausted.

The next day, I walked up to the Hardware Engineer who had a big grin on his
face. "I think I found your problem...", he opened. I was skeptical, but
excited and urged him to explain. "In the last board spin, I rerouted the clock
source of the MAC controller. So the Microprocessor and the MAC were actually running off two different clocks!"

I was elated. This perfectly explained the "interference pattern" we had
observed. The frequencies of the two clocks were supposed to be the same, but were not perfectly aligned. Even a slight difference in frequency would cause a 'beating' effect as they drifted in and out of phase. Much like you can hear when tuning a guitar, and two strings are almost, but not quite, in tune and you hear a lower frequency 'wow'.

So - while the two clocks were aligned, the microprocessor and the MAC
controller chip could reliably communicate, and the packets flowed normally. But as the clocks drifted slightly out of phase, their chip-to-chip communication
was corrupted as the rising and falling signals between them became misaligned. This explained why packets appeared to be sent or received at the higher layers, but were in fact lost in the intermittently garbled transfers. It's almost a marvel TCP/IP worked at all!

The Fix

In the end, it was a simple matter of ensuring both chips used a common clock
source - which required a modificaiton to the PCB routing. But for now, to test
the issue, the Hardware Engineer broke out the soldering iron and fixed the
routing by hand on our development system. (We were fortunate that the clock
signal was on an outer layer of the PCB!) I started the test again, very
nervous to see the results. After days of chasing ghosts, I didn't want to get
my hopes up.

It worked! The hand-crafted arp and ping tests would run for as long as we
liked, and never skipped a beat, running as fast as it could go. Finally, full
throughput was achieved. I queried the registers for protocol and link errors,
and it was all good. I checked the TCP/IP layer diagnostics for errors and
statistics, and there were no red flags. I went back to the original application
firmware and tested out the protocol, monitoring the wire for good measure.

This time, it took less than a second for a full status dump. Finally - success!

Wrapup

So - what began as a seemingly simple software performance problem eventually turned out to be caused by a hardware design fault. And it revealed several other opportunities for improvement along the way. This was a great learning experience, and a very satisfying puzzle to solve!

Random thoughts:

TCP/IP is really, really good at surviving unreliable hardware and problems in layers below.
Don't mix clock sources between chips!
Don't assume that the first problem you find is *causing* the problem.
Don't assume the improvement you make is sufficient. Measure!
Performance is relative. What can you reasonably expect from the system in front of you?
Performance tuning is all about careful measurments and consistent tests. And changing one thing at a time!
It's hardly ever the fault of the OS. But it could be. And it's hardly ever the fault of the hardware. But it could be.
Don't be satisfied with a fix until you understand *how* it works and *why* it fixes the problem.
It is sometimes possible to diagnose hardware problems through analysing software.
Persistence usually pays off!

Quora User

Former Avionics Technician at U.S. Air Force (1972–1976) · Author has 10.5K answers and 9.4M answer views

· 7mo

It was about 1975 or so. I was an E-3 level 3 Defensive Fire Control Tech on B-52H models at Grand Forks AFB ND.

The Defensive Fire Control System (DFCS) was adapted from the B-58. The B-58 had one radar. The B-52 had two, one angled left, the other to the right.

One new box was added, the System Control Assembly (SCA) to manage the right side. The left side used the Tracking Control Assembly (TCA), like the B-58. Both worked in conjunction with the Ballistics Computer.

The left side wouldn't track. The TCA was swapped out several times as was the Ballistics Computer. Wiring was checked between t

It was about 1975 or so. I was an E-3 level 3 Defensive Fire Control Tech on B-52H models at Grand Forks AFB ND.

The Defensive Fire Control System (DFCS) was adapted from the B-58. The B-58 had one radar. The B-52 had two, one angled left, the other to the right.

The left side wouldn't track. The TCA was swapped out several times as was the Ballistics Computer. Wiring was checked between the two.

The left side antenna was removed and replaced as was the Controlled Line Platform (CLP), a set of gyros that kept everything aligned, as well as the operator hand control, a joystick with a few switches.

Everything we pulled tested out fine on the test sets. All wiring checks were fine.

I was basically a shop guy, running tests and repairing these larger assemblies. I got out the TOs. I looked at schematics. I found what I thought might be the problem.

The SCA had a wire, inside, that did nothing but pass through.

I suggested we swap out the SCA, a unit with no business in the left side tracking. The E-5 Level 7 scoffed. He had worked B-58s. He knew better.

Swapping out an SCA wasn't hard … two bolts and some connectors but the E-5 wanted this checked before we did the remove and replace, a pain in the ass because it wasn't easy reading the pin numbers on the connectors and making certain you had the right one at each spot, the meter lead all the way in on a female and not touching two male pins at the same time.

After climbing in, checking it several times while holding my flashlight in my mouth and balancing the meter I concluded I was right. It was dead open.

The E-5 scoffed and said I must have had the wrong pins but finally agreed, probably just to shut me up, to swap out the SCA.

Fixed the problem…on a Saturday morning.

When I ran the tests on the SCA it passed. (Tests were automated, run from a punch tape that checked all the functions of the unit.) Because of that he refused to sign off on shipping it to depot for the fix, something beyond base-level authorization

Checking it with the multimeter, with the E-5 and the E-8 shop chief looking over my shoulder, showed what I saw on the flightline. More digging showed the automated tests never checked the pass through.

The unit was shipped to depot but AFAIK the test of that wire was never added to the TO or the tape. It happened once in 15 years on a system that was outmoded.

Eric Chaikin

word enthusiast, moral hedonist

· 9y

I managed development of a a real-time stock market analytics workstation back in the Stone Age (the 90s). We took in a feed of all U.S. stock trades at each client site, and combined it with a terminal-based (as in VT-220) trading session at each desktop. The trading sessions would "randomly" freeze. More frustratingly, after minutes and minutes frozen - they would sometimes just unfreeze - and spool out all the data that had been received and keystrokes processed. Went on for months. We captured and replayed every byte and keystroke and couldn't isolate the issue, no matter how hard we tried. One day, the freeze happened in my office. It was a very hot day, and I had turned a fan on at some point. For fun, I tried turning off the fan - unfreeze. On: freeze. Off: unfreeze. After all of the time we spent tracing through lines of our own code, it turned out a small electric charge was causing a bit register in a chip on a board on a LAN card to flip, causing the NETBIOS driver to buffer the input indefinitely. Until another electric zap flipped the bit again and caused all the buffered data to unspool. Intel, who built the PCs we were installing, flew out one of our engineers to show them the problem in person. Whew.

Doug Massey

ASIC designer and Verilog/VHDL programmer, 1992-2013 · Author has 2.1K answers and 26.2M answer views

· 8y

The toughest bug I ever fixed required a method that I and three co-workers patented and used on all our future designs: Patent US7251794 - Simulation testing of digital logic circuit designs [ https://www.google.com/patents/US7251794 ].

I designed ASICs (Application Specific Integrated Circuits) for IBM to handle Ethernet data flow and protocol coding — basically, it takes all the data you want to

I designed ASICs (Application Specific Integrated Circuits) for IBM to handle Ethernet data flow and protocol coding — basically, it takes all the data you want to send somewhere, encodes it a bit so it can be sent one bit at a time over some longer distance and then recaptured and decoded on the other end of the wire. Because it’s all digital logic, there’s a clock on the transmitting side (nominally at 312.5 MHz) as well as a different clock at very close to the same speed on the receiving side. However, these two clocks come from different sources, so they’re inevitably just a little bit different — up to 0.02%.

So when a receiver captures data from this serial wire, it has to move the data from one clock to another — and that occasionally means that it has extra data that it has to get rid of (if the receiving clock is 0.02% slower than the transmitting one). No problem, the Ethernet rules put in gaps so that you can dump them without harming any of the actual data. The receiver had to be carefully designed to recognize exactly what was data and what was a gap, so it would remove the right thing.

We’d built a new design for a customer and the first few samples worked pretty darn well — except that every couple of minutes, the Ethernet links would reset themselves. Not good. Given that data flowed at 10 billion bits per second, this was a looooooong time between fails and made it close to impossible to be able to simulate in software testing (which runs about 1 million times slower than actual life). I flew out to the customer and went to work in their lab to try to make sense of what was happening.

Fortunately, the customer had seen something like this with another vendor as well and had been able to narrow it down a bit — it had something to do with the differing clocks (because when they used the same source for the clocks in the transmitter and the receiver, the problem went away). That helped me realize it might be the receiver losing its ability to identify the gaps in the data — but how would we be able to recreate that fail in our simulation environment to find the bug in my design?

First, math. We artificially shrunk down the length of the data packets, which increased the frequency of the gaps. Then we increased the difference between the clock frequencies by *just* the right amount to re-create the boundary conditions we needed to cause the same sort of issue we might have seen in real life. This sped up our verification simulations so that we could potentially encounter the problem faster — but we still never saw a fail unless we intentionally made the gaps too infrequent or the clock differences too great (in which case the design wasn’t being operated within specifications and it’s not really a bug).

So it was something even more rare than that — something that wasn’t occurring in the digital world of the verification simulation programs. We got to thinking about metastability and how that could cause problems. Metastability is when you try to capture data from one clock domain with another that isn’t synchronous — if you’re terribly unlucky, you might try to grab a bit into a flip flop *just* as that bit is changing and instead of capturing a 0 or a 1, you get something in between that takes a little bit of time to settle one way or the other (the data is “metastable” — as if it’s a marble perched atop a smooth hill, about to roll down to one side or the other with the slightest noise or perturbation). We knew all about metastability, of course, and how to Gray-code counters and double-flop bits to ensure that logic never saw fuzzy 0.5 values — but this wasn’t directly reproducible in digital simulations. We couldn’t even see what was happening.

The customer was a really clever guy and made a suggestion: instead of using just a continuous receiving clock in our simulations that was always 312.5 MHz (plus or minus whatever was needed to make it slightly different than the transmitting clock), randomly move the edge of the clock around so that when you get close to the metastability problem, you sometimes get the new value and you sometimes don’t. We did that — still nothing. Everything passed in simulation.

So I dove into the simulation environment (painfully — I was on the West Coast trying to run simulations and look at results from computers on the East Coast, in the mid-2000s) and tried to look for any funny business. The screen was filled with waveforms — random wiggles to anyone who didn’t know what they were looking at, and usually even to those who did. :-) It’s *really* hard to make sense of incoming data that’s encoded, but the nature of this problem indicated that this is where the problem was occurring. Not even van Neumann could look at binary data and find an answer directly.

Out of desperation, probably, I just started scrolling around on the screen and happened to re-size the window when I saw something — a repeating pattern in the binary data, when that shouldn’t have been happening. I had to have all the address bits on the screen in binary (rather than in bus form, which would have displayed as a hexadecimal value) and had to have it zoomed in to *just* the exactly right level to be able to see it — but sure enough, there was a skip happening in the data that shouldn’t have been. It only happened once in a great while and usually, the design’s receiver would work it all out before it happened again and we’d be able to survive. But that’s when I realized that if it randomly happened multiple times in quick succession, the receiver would “walk off the edge of the earth” and lose its mind.

But the probability of seeing it happen in simulation was really quite unlikely — it would require stupendous luck or a ridiculously long test (which would gather an obscene amount of data that would likely crash any computer we were using before the test finished). On the flight back to the east coast, I just started drawing out ideas. By the time I’d landed, I had a kernel of an idea.

I took it to a co-worker (and eventual co-inventor, Frank Kampf) and describe the whole thing. He agreed it was a good idea and we went to a conference room to figure it out. We would use a state machine in our simulations to vary the clock edge by extreme amounts — basically an entire half-cycle at a time — in order to exacerbate the metastability effects without violating the clock frequency requirements. I would draw a state machine on the white board and Frank would shoot it down. Then Frank would draw one and I’d shoot it down. Back and forth we went for about an hour — until we had one that neither of us could find a problem with.

We then called in the two other future co-inventors and showed them our idea (Suzanne was my lead...

Anthony Gold

entrepreneur, mistake maker, life learner

· 10y

This is an embarrassing story to recall, but here it is.

In my past-life I was a hardware engineer for Unisys working on the design of their A-Series mainframes. These were big iron machines – think very large refrigerator size – that cost and sold for lots of money (often millions per). No, that’s not the embarrassing part ;)

We had a new mainframe that was soon to be released to the market. It was the most powerful (and expensive) one we had ever built, and customers were eager to get it. This was a major product for the company, and a lot hinged on this release. We had a very tight schedule t

This is an embarrassing story to recall, but here it is.

Testing was occurring across all three shifts, but there was one elusive bug that threatened to jeopardize the release. It was so bizarre, and no one could figure it out. Here’s what happened:

At random times, the system would “go down” – which in mainframe terms meant the OS would crash and everything would freeze – a power-cycle being the only way out of the crash. The time between crashes could be as short as a few minutes or as long as many hours – there was no pattern.

We tested every theory imaginable from rouge apps with memory leaks to charting solar flares thinking perhaps some stray gamma rays were thwarting our ECC logic. But no matter where we looked, we couldn't locate the source. This went on for days without a resolution and quickly escalated to senior management given the potential huge financial impact to the company.

Now comes the embarrassing part.

I was working 2nd or 3rd shift – I don’t remember which, but it was late at night. There were three of us on debug at the time. I was walking past the machine when someone shouted out, “It just went down!” We all rushed back to the operator terminal to see what was running right before the failure. As usual, nothing specific.

And that’s when it hit me.

I told the other two to restart the machine … I had an idea.

In those days (early 90s) I had very long hair, carried an over-sized comb in my back pocket, as was the epitome of a dorky nerd trying to be cool. It didn't help that I could do a spot-on renditions of Ice Ice Baby and Funky Cold Medina.

Anyway, after the mainframe was rebooted, I walked back to the machine, combed my hair, and then touched the frame of the computer with my finger. Sure enough, the instant I touched the machine, the other two yelled out, “It crashed!”

Long story short, we had a grounding issue with the computer, and spurious static electricity was randomly causing the machine to crash. Once a grounding plane was installed the problem was solved, machine shipped, customers were happy, and we made a ton of money.

And while I’d love to take credit for amazing powers of deductive reasoning, the truth of the matter is that dumb luck and my ridiculous hair led to the source of one of the company’s most elusive bugs.

Jeff Kesselman

25 years in the video game industry. CS major from UW Madison. Have done everything from tool and library coding up to CTO. Currently teach Game Programming at Northeastern University and Daniel Webster College and write academic apps for MIT. · Author has 12.2K answers and 23.5M answer views

· Updated 9y

if (bitmask & 0xF0 == 0xF0) ...

This cost me a week of chasing on a bare machine (no printf, no debugger, no ICE) in college.

C++ Operator Precedence specifies that the == operator is evaluated before the & operator, so the code is evaluated as:
if (bitmask & (0xF0 == 0xF0)) ...

Due to the lack of strong typing for Booleans in C/C++, that is equivalent to:
if ((bitmask & 1) != 0) ...

The intended meaning was:
if ((bitmask & 0xF0) == 0xF0) ...

This is why I tell my students to always use parentheses rather than relying on operator precedence. Parentheses are free, and (in addition to pr

if (bitmask & 0xF0 == 0xF0) ...

This cost me a week of chasing on a bare machine (no printf, no debugger, no ICE) in college.

C++ Operator Precedence specifies that the == operator is evaluated before the & operator, so the code is evaluated as:
if (bitmask & (0xF0 == 0xF0)) ...

Due to the lack of strong typing for Booleans in C/C++, that is equivalent to:
if ((bitmask & 1) != 0) ...

The intended meaning was:
if ((bitmask & 0xF0) == 0xF0) ...

This is why I tell my students to always use parentheses rather than relying on operator precedence. Parentheses are free, and (in addition to preventing bugs like this one) communicate your intended meaning to anyone reading the code.

(Edit: BTW this is an example of why strong typing is valuable. C# won't even let that compile.)

Quora User

Lives in Las Vegas, NV · Author has 10.8K answers and 56.5M answer views

· 10y

My story is about 40 years old, mid 1970s. We had made our first microcomputer product - 8080 based - and it worked, mostly. Except that every few hours it would make an error. Not a large one, just a skip in a count, but since what it was counting was money, it mattered.

There were three of us in the team, and we worked all hours for two weeks trying to track it down. Finally Deadline Monday was approaching and it still wasn't fixed, so Thursday afternoon we set up a program trace and as every instruction was executed, the register set was printed out on a teletype, which was the only printer we had. Every. Single. Step. Clatter-clatter-clatter. This of course reduced the program execution speed to about one instruction every three seconds, but it was our only hope of catching the error.

We took turns watching that thing all weekend, hour after hour, roll after roll of paper. We slept in shifts, lived on pizza and beer. The air was thick with cigarette smoke and the lab smelled like a locker room. Finally, sometime on Sunday, it faulted. There were three lines in an interrupt routine - fetch variable X from RAM, decrement it, write it back. Three instructions. This one time, it didn't decrement.

We looked at the stack, to find out what was happening when the interrupt occurred. We looked at the next instruction after the return - it was, save a register to variable X. The interrupt was working just fine, but its result was immediately overwritten by the routine it interrupted. If the interrupt arrived a microsecond earlier or later it didn't cause an error, which is what made it so hard to catch. It was a valuable lesson for me - I've never made the same mistake since. Variables altered by an interrupt routine must be read-only outside it. (In this case, we put a DI/EI around the external routine to prevent it being interrupted before it finished modifying the variable. You do what must be done when you ship tomorrow.)

Don't judge us too harshly for the obvious error. At that time an 8080 cost a week's wages and none of us had more than a year of programming experience, nor any place to turn for advice except the device data sheets. Everything we knew was self-taught from first principles. Life is too simple for you young whippersnappers of today, with your hard drives and compilers and Java, whatever the hell that is. Assembler and paper tape and debugging with a logic probe, that'll make a man of you. Now you kids, get off my lawn.

Malcolm Teas

Studied at Computer Programming · Author has 4K answers and 6.5M answer views

· 8y

Two weeks. It took me two weeks to find this one. A very intermittent bug to find, but crashed the app and usually it seemed to "know" when things were critical and crashed then! So an important one to find.

It was an "executive information system", what today would be called a "management information system" I think. It was a very early example of a client server system. When the system was especially busy (like month or quarter end times) and the user was asking for a particularly large data set (like full sales or production report) the app would – sometimes – crash. But! Not right aw

This was intermittent enough that for a while we weren't sure if it was actually a real bug. But I was assigned to investigate and fix if possible...

The server would, depending on load, chunk data up into larger messages when the server system was busier. In this system sending data was expensive to the server, so this made perfect sense.

The client would write the data into a large buffer and parse it out. So at these times the buffer would be close to or at max level.

This was some time ago, and low-level debugging was in assembler and hex memory dumps. So the tools were painful and not as slick as they are now.

Eventually I found through a process of careful elimination and systematic experiment that the data buffer had an off-by-one error. We thought we could write up to 4096 bytes, but only 4095 were allocated. (Numbers not accurate, this was a long time ago.)

The value right after that buffer in memory was a boolean used in the UI of the app. (Local var, but statically allocated, so not on the stack.) When the buffer got full and overwrote that boolean byte, then maybe it would change boolean value. This upset the UI with an inconsistent value and ended up (several stack frames later) crashing the app.

But, the UI only noticed this when the user used a certain path (clicks and button presses) through the system. An alternate path would set the bool and correct the value. Some people liked the first approach, others use the alternate. So the problem happened only to some people.

So it only happened when the server sent a full data block, and this tended to happen only during "important" times when a number of users were on the system accessing larger amounts of data.

Intermittent bugs are bad. Memory overwrite bugs are bad. Combining them is worse!

Mark Harris

Techie | Workaholic | Traveller

· 5y

In my college days, I was working on the final year project, we actually wanted to convert the data into excel file dynamically via programming(i.e click the button to convert the data into excel format ). So we started the project to convert data into excel format file. In 2 months of hard work, we made our file conversion project. We started using the project on a daily basis. It works well. Mostly all conversions work well. But some of the conversion files had some issues. Actually, the issue is, data to excel file conversion works perfect, but the file is not opening in Microsoft Excel.

whe

when we try to open the file, we have a popup like this,

Source: Google Images

This Issue is reproduced in the strange converted file. we have no idea for this Issue. Even we don’t know, how to reproduce this issue and fix it. we had to spend at least more than a week to reproduce the issue.

Finally, we reproduce the Issue. Actually, we convert the file into one or more than sheets of single excel file. first, we confirmed that there was no issue with a single sheet excel file. we tested all the cases, it works fine. after, we started with more than two sheets excel file to open in excel. The issue reproduces in a strange case(i.e some files opens, some files not getting open). we started googling about this issue. but still, we were not getting any glue for this issue. after all, we had an idea, actually many developers tried and developed conversion into excel file and many developers may face the issue, so we searched the issue description in GitHub. And finally, we have the case.

Actually in Microsoft excel sheets did not allow the same sheet name to more than one sheets. it will show the error when we save the file in excel. when we generate via the program, we have no restriction to use the same sheet name to multiple sheets. after we found the root cause we solved it.

Mark Phaedrus

Has programmed computers professionally for 30+ years · Author has 905 answers and 4.7M answer views

· 2y

“Hard” is obviously a moving target. The more experience you have, the easier the fixes get. Of course, more experienced programmers tend to end up working on harder bugs to solve, so it can even out. I think the hardest-for-me bug I had to deal with was one of the first difficult bugs I encountered in my career. Especially because not actually just a code bug. It was a corporate relationship bug.

And it is a long story. Follow me, gentle reader, on this journey back in time.

There was once Company X, that did customized user interfaces for database software sold by Company Y. Company Y also did

And it is a long story. Follow me, gentle reader, on this journey back in time.

There was once Company X, that did customized user interfaces for database software sold by Company Y. Company Y also did customized user interfaces for its database software. You see the problem. There were reasons why Company Y needed to support Company X. Nevertheless, the relationship was… tense.

Company X landed a big contract to develop user interfaces for Company Z. There was much rejoicing.

There was a catch. The big contract required Company X to provide a user interface for the Macintosh as well as for PC. And up until that point, Company X had been exclusively a Windows developer. So, Company X was now committed to deliver a Macintosh user interface, based on absolutely no Macintosh experience.

But there was a catch to this catch. The vast majority of Company Z’s employees used Windows. They had gone with Company Y’s database because it would work the best for their Windows users, and with Company X because of its Windows user interface’s capabilities. Macintosh support was a nice-to-have, a checkbox in the corner of the design requirements. So, the contract was very specific about what the Windows user interface should be. But the contract terms covering the Macintosh user interface were almost an afterthought. It had to exist, and it had to meet delivery milestones, and it had to be usable.

And that was where I, the intrepid young programmer, entered the picture. I was hired by Company X as the Macintosh Division. No, that’s not a typo. I would be the Macintosh designer, the Macintosh coder, the Macintosh documentation writer, and the Macintosh field support staff.

Anyway, Company X brought me on, and gave me enough of a budget to purchase one Macintosh and one set of development software. They showed me how their Windows software worked — namely, in ways that were so complicated and so platform-specific that porting them to Macintosh was out of the question. They handed me the floppy disk with Company Y’s new Macintosh library, and a list of the delivery milestones for the Company Z contract. And with that I was launched on my journey.

I knew that I had been hired largely because the stakes were low. Which also meant that I had little to lose. I leaned into it. I didn’t even try to replicate the Windows UX. Instead, I made something very small, but still very customizable. You could reconfigure my little Macintosh UX in five minutes, just by editing resource files. Not just the appearance of the UX, but also the database fields that would be displayed, and the ways that the user could change them. That wasn’t that unusual in Macintosh software at the time, but it was unheard of on the Windows side of things, where they added custom code changes and built separate versions of the applications for each customer. Everything was coming together. I was feeling good. I could actually do this!

Except.

When I tested my code against an actual Company Y database, sometimes it would crash. In all sorts of different places, and in all sorts of entertaining ways. Not always, but often enough to make my code unusable. And the simple test program supplied by Company Y worked fine, so it wasn’t the database or the library. It was something I was doing. Which was odd, because I could see that I was passing the right parameters to things. What was up?

Debugging tools at the time were much cruder than they are now, especially for intermittent bugs. I simplified my code again and again, trying to pin down the problem. Eventually I had a program that wasn’t much longer than Company Y’s test program. It still crashed.

Now I was in a crisis. I couldn’t see what I was doing wrong. It was Mac-specific code, so there was no one at Company X that I could ask for help. This was before the real rise of the Internet, so there were no forums to turn to. I absolutely knew that saying “I think there’s something wrong with this third-party library” was a non-starter for a new programmer. Company Y’s willing to ‘help’ us with the Macintosh was very clear: “Does our test program work? Yes? Then the problem’s in your code. Go away and read the documentation again.”

With no real alternatives left, I started debugging my way into Company Y’s library, in the hope of just learning exactly what part of my code was triggering the crash in theirs. I had one advantage: the folks at Company Y hadn’t built a proper retail build of their library. The library had been built using some settings that would normally be used for debugging. So even though I didn’t have source code for the library, I could still use those debugging symbols to get hints of what was going on. The names of functions, the way that things were arranged in the original source files.

And I finally found the problem. When the library changed the system state in a certain way, it needed to increase the size of a block of memory. And it did that using the C language’s realloc function. And the programmer missed a crucial point: realloc can fail. And when realloc failed, the program would merrily assume that the block was as large as it requested. It would write new data into that ‘blank space’ that in fact was still very much occupied by something else. And then, maybe much later on, some other part of the program would try to access or modify that ‘something else’… and… boom.

The only reason that Company Y’s test program worked is that it was small and did things in precisely the same order every time, and it just so happened that this could never trigger the crash.

I suspect that Company Y’s Macintosh Division was similar to mine.

In any event, I hesitantly approached my Company X managers, and showed my work. And shortly afterwards, there was a meeting between Important People at Company X and Company Y.

And the Important Person from Company X said, we are experiencing crashes with your code, this is unacceptable, aren’t you testing this stuff?

And the Important Person from Company Y said, it’s your code, it’s your problem, don’t come to us and blame us for your issues.

And I tried very hard to sound confident, and I said, it’s your code, and specifically, it’s your call to realloc in your line 234 in your function DoSomethingOrOther in your YourSourceFile.c.

And the Important Person from Company X smiled, and there was a long pause, and I remember ABSOLUTELY NOTHING about the rest of the meeting.

But shortly thereafter, I was handed another floppy disk with another version of Company Y’s Macintosh library. And debugging symbols were still turned on, and there was a new function in YourSourceFile.cpp, and it had some sort of irritated-sounding name, and it fixed the problem. And I shipped my work off to Company Z, and they accepted it, and the “Macintosh support” clause in the contract was dutifully checked off.

The managers at Company X deemed my little Macintosh user interface to be worthy of being offered to other customers. The Macintosh Division remained open for business.

This would eventually lead to me receiving a safety briefing at a federal nuclear facility. “This alarm means you should head to the nearest shelter. This alarm means you should shelter in place. This alarm means you should make the most of your last few minutes.” And this would lead to me reconsidering my life choices.

But that’s another story.

Ira Baxter

50(!) years of Software Engineering hands-on experience. · Author has 974 answers and 1.8M answer views

· Updated 8y

Three. All having to do with rare, nondeterministic events. This means you can't find them easily with a debugger.

1970: I worked on Data General Nova minicomputer serial number #3. [The Nova was essentially the first RISC machine, with a stunningly odd set of arithmetic instructions, including ADCZL# 2,3,SBN. You don't have to know what this means to realize this is an odd machine]. We coded an assembly language device driver... that mostly worked. Occasionally it would fail. (We're doing debugging with front panel switches and a really bad debugger, remember that #3?). It turned that an indexed branch with a negative offset would sometimes go to the wrong place... how? I chased this for days before I got down to an instruction sequence "interruptdisable", "jmp -index[reg]", "interruptenable" where the problem occurred with relatively high frequency (but never when you single stepped it). We decided that disabling interrupts set a flip-flop near the ALU (you know you are desperate when debugging code when you decide to look at the circuit diagram of gates that make up the CPU) and the extra current demand would make the ALU math slightly flaky. We sent the CPU back to Data General, they told us we guessed right, they fixed it and sent it back. Voila, problem solved. Nice to spend weeks to find a design error in somebody else's hardware.

Moral: don't depend on a flaky circuit design.

1974. On a one-of-a-kind 16 bit VM minicomputer I and another fellow (Dennis Brown, hello!) designed had a fancy-shmancy register-chip in it to hold the CPU's registers. I designed an assembler and linker for it; the linker would print a symbol table of names and corresponding addresses on a teletype at the end of the link edit step. Sometimes... the symbol name would print out as complete garbage, but the address was fine and other symbols might or might not be fine. Ultimately we found the culprit: the register-chip bits would turn from zero-to-one sometimes when it got really hot; it would only get really hot when the program was doing heavy duty math in the registers; and the linker was taking a radix-50 encoded symbol and tearing it apart by doing a repeated divide-by-50 (tight loop: compare, shift, subtract, repeat). Problem went away when we blew a lot of air of the register chip. (Freeze spray? in 73?) Cure: complain to the register-chip vendor, get a replacement chip.

Moral: don't depend on a flaky chip.

2012-2014: Using MS Windows, with working thread-switching code on Win32. The thread switching code used MS Win32 API commands to SuspendThread, GetThreadContext, SetThreadContext, ResumeThread. This code was written around 1999 (yes!) and had been stable for 15 years(!). On Windows Vista ... sometimes (once every few million times!) the app doing the thread switching would crash. Try and find such a problem; it occurs abominably rarely and the symptom is "die horribly". This almost made me tear my hair out. I tried huge numbers of experiments, adding consistency checks in an astonishingly large number of places in the code, to little avail. Eventually discovered it was Wow64 emulation of these formerly rock-solid calls.... GetThreadContext *lies* about the thread context, that is, it is supposed to return what is in the registers of a suspended thread, but sometimes returns trash. This is incredibly hard to detect: what do you look at, to see that it is wrong? I didn't debug this so much as recognize, via desperate web searches over 1-2 years, that another person had encountered the same problem; see WOW64 bug: GetThreadContext() may return stale contents. Why didn't I notice on XP-64? Because there it *works*, at least I have run it literally billions of iterations without ever encountering a problem). This is an unforgivable sin for an OS call: a system call for managing thread context that is simply unreliable. To this day, MS has not fixed it; they say Windows 8.1 will tell you when GetThreadContext just lied to you, which is hardly a good cure. [I have no evidence yet that Windows 8.1 tries to tell me this reliably; I have evidence that Windows 10 claims to tell me this]. Cure: my thread switching code now sets a flag when it makes an OS call (in the hundreds of places that it does so, sigh); if the flag is set, I simply don't use SuspendThread/GetThreadContext; at least this solution I trust.

Moral: don't depend on a flaky software vendor. (But hardware you can get fixed).

Richard Farnsworth

Former Software Engineer · Author has 4.5K answers and 7.5M answer views

· 7mo

I wrote a GUI and database that was used by the operations people to make mission critical decisions from. It ran in a failure tolerant environment in a 24*7 operates and control room. This was in the late 80’s or early 90’s and I was using Vax mini computers, mini is a relative term, these were about the size of a large washing machine. There were two processors, labeled a and b

Every so often it crashed. Perhaps once every few weeks. Unacceptable, so I went in to find out what was going on. I wrote some diagnostic software to trap the problems. I isolated it to a specific module, then to a specific routine, then reduced it down to about 5 lines of code. It still ran for weeks without crashing, then it crashed all the time.

eventually I twigged, it only crashed on one machine. In those days, the cpu wasn’t a single chip like it is now. I turned out that one board on one processor was faulty, and it only affected one instruction. The one I used in my algorithm. Since there was an online/standby arrangement, the fault hid behind the online machine and only sometimes failed.

Sometimes your code is perfect, but the hardware is faulty.

Aditya Mendiratta

I make my own mistakes · Author has 221 answers and 619.3K answer views

· Updated 10y

A couple of my friends in college, were from the electronics background, and had joined the computer department.

They were new to programming, so they asked me to help with some PL/SQL script that just wouldn't work. The script looked fine. I wrote the same on my computer, and it worked just fine.

They let me bust my ba**s on it for a while, before they told me they had copied it from a book. Dragging my cursor across every character, I stumbled upon the bug, the most epic bug I have ever come across.

They had used two single quotes (apostrophes) in place of an every double quote.

That was a new l

A couple of my friends in college, were from the electronics background, and had joined the computer department.

They were new to programming, so they asked me to help with some PL/SQL script that just wouldn't work. The script looked fine. I wrote the same on my computer, and it worked just fine.

They had used two single quotes (apostrophes) in place of an every double quote.

That was a new low in the history of computing.

James Martin

Author has 3.8K answers and 5.8M answer views

· 8mo

Bugs can be hardware, too.

Had a piece of equiipment that would intermittently shut down during flight operations. And the great thing about intermittent failures is that they never occur when you are in a position to troubleshoot.

Bugs can be hardware, too.

Eventually, through months of intermittent heartache, I found the problem. The equipment had been produced by a commercial company, and they had supplied the main cable as well, which provided power (28 VDC) and signals. The company had used cheap wires in the cable, including cheap, low-temperature insulation. At some point there had been a power surge, the power wires had gotten hot, and the wires’ insulation had melted, allowing the power lines to short. When things cooled down, the wires would pull apart and the short would heal itself.

In operation, if the cable got just warm enough, the insulation would soften and the power lines short out, killing the equipment. When things cooled down the cable worked just fine. and would test perfectly. Our organization made it a point to use only Teflon-coated wires, so it did not occur to me check for cheaper insulation effects. Since the issue was occurring in the heart of a cable, it was physically invisible.

Murphy was a very clever guy.

Tom Clement

Hopes he's aging gracefully and finally maturing a little bit. · Author has 1.3K answers and 4.4M answer views

· 11y

I wish I could say this my brilliancy, but it's a good story. In the 1980's the system we were developing ran mostly in 'high memory' on PCs in LISP, but there was a terminate and stay resident (TSR) portion that ran in the lower 640K that was written in C. An exceptional coworker of mine was debugging a very strange intermittent behavior and had, by stepping through the assembly code, narrowed it down to a string comparison that seemed to fail when it shouldn't. He described it to my boss, and, without missing a beat, she said: "I wonder if LISP is leaving the CPU's DF register flag set to reverse comparison?" After investigation, it turned out that (of course) she was right. LISP always set the direction flag before doing a comparison, and left it where it was. The C Compiler we were using always assumed it was set to forward, and only set it (and restored it later) if it needed a reverse direction comparison. It still amazes me to recall that story. I would have spent days if not years trying to figure it out.

Jeff Darcy

"ask for topic bio" was a mistake · Author has 1.4K answers and 6.8M answer views

· 12y

Positions 2-10 for me were all race conditions of various sorts, but it turns out that position 1 was not. I was working at Revivio, which made a storage appliance, and we were trying to deal with some mysterious hangs. These were hard hangs which would kill even our kernel gdb stub, so normal debugging wasn't possible. I was wishing for some kind of hardware support so we could examine memory even though the CPU was out to lunch, when I realized we had it. These machines were connected via IB, so I wrote some code to export one machine's kernel+vmalloc regions to another and wrote a littl

In phase two, I started collecting dumps but they didn't make sense. I started to notice that some kernel stacks were getting corrupted, e.g. with one frame pointing into the middle of another, and addresses (including code addresses) on the stack that could have nothing to do with each other in any possible call sequence. I added guard pages and memory scrubs and extra pointer checks in various places, to no avail. I sort of gave up and started looking at the few dumps where it seemed like a task struct had been corrupted instead of a stack (which I had previously written off as likely bugs in my IB/ELF hacks). Finally I realized what was happening. Sometimes a task would allocate so much in a single stack frame that it would jump all the way over the rest of its stack, all the way over its own task struct and any guard areas, into the next task's stack. Then it would return happily, but the next time the "victim" ran it would explode.

In phase three, I wrote a script to disassemble each of our many kernel modules and find large subtractions from the stack pointer. I found and fixed not only the likely culprits for the hangs we were seeing, but many more that were likely to cause problems later.

Moral of the story? If you have people who aren't used to writing kernel code, review their work very carefully for things like stack abuse and synchronization/reentrancy problems that they never had to deal with on Easy Street, until they're fully trained.

Rich Sadowsky

Disruptive emerging technologist with 35+ pro experience · Author has 790 answers and 3.4M answer views

· Updated 7y

This one goes way back to before we had windowing operating systems. The operating system most people used was MS-DOS. It could not multitask. I worked for a development tool company called TurboPower Software. We wrote some of the most popular programmer's libraries at the time. I wrote a tool in assembly language primarily that allowed a programmer to turn their code into a popup (terminate but stay resident, or TSR, program). This allowed you to have code that stayed resident while you ran other applications and popped up when you hit a certain hotkey. The challenge was TSR programs had to be tiny or they'd take all the RAM memory. My tool left a little 6k piece of code resident so it took almost no RAM regardless the size of the program it was bound to. When the hotkey was pressed, it would swap out the RAM of the program in the foreground and swap in the rest of the TSR program. This allowed you to turn huge applications into TSRs effectively creating a multitasking environment for DOS. It was very popular and worked very well. Keep in mind this was 30 or so years ago so my memory may be as slightly buggy too.

We got a report that programs written with our TSR library were was crashing hard for a few people. Although this only happened to a very small subset it was becoming clear it was a bug in my code and not the applications people were writing. We tested and tested and could not reproduce the problem. So we started asking increasingly detailed questions about the environment this was happening in.

Again, my code was primarily assembly language code, systems level code. It talked directly to the OS and the machine BIOS. We eventually determined a common denominator in all the cases where it was crashing. It only happened on machines with a certain version of the BIOS (a ROM module on the motherboard that implements certain systems level hardware-related code such as reading the keyboard). The ROM in question was from one of the biggest ROM vendors at the time. We obtained a computer with this same ROM. We weren't immediately able to reproduce the problem but eventually we did. I began tracing through every machine language instruction executed up to and including pressing the hotkey which invoked my little 6k code segment.

A little background on how CPUs and assembly language works is needed: The CPUs at the time were 8088 Intel processors and their successors like the 80286. This processor has a set of "string" instructions which allow them to move a sequence of bytes from a source to a destination. The string instructions can go forward or backward from the starting point. A bit flag in the CPU determines the direction.

You can probably guess what happened now. Under this one scenario, we came out of a BIOS call and the direction flag was set to backwards which overwrote the operating system in memory leading to a hard crash. The bug was I failed to reset the flag to the needed direction on return from reading the keyboard. In our testing of other BIOS/DOS combination, including others from the same vendor it seemed, the direction flagged was always set forward so I never encountered the bug. The lesson here is to never assume the state of a state machine if you have yielded control to something else that could modify said state machine. I fixed this bug quickly by adding a single instruction to reset the flag to forward direction. The code lived on for years and close to a million people used the programs written with my tool.

Eventually Windows 3.0 hit the scene and we all took a new approach to writing multitasking applications. This was all sometime around 1990. To this day that is the hardest bug I've ever chased down. The learning here is relevant to any programming language. This happened to be systems level assembly code, but any state machine or similar construct could suffer a similar bug.

Jay Best

Entrepreneur, CEO, Geek · Author has 835 answers and 2.6M answer views

· 11y

Ok so I worked for a telco as a broadband specialist to identify and help solve complex problems.

I tried to use an open source gene sequencer to help identify hexadecimal code patterns relating to remotely identifying routers which connected to our network (I managed to identify 80% of the chipsets, and hundreds of router specific problems).

I realized that there were a lot of problems which speci

Ok so I worked for a telco as a broadband specialist to identify and help solve complex problems.

I realized that there were a lot of problems which specifically related to the customers routers so finding a way to remotely identify the router could mean that we could datamine en-masse (eg if we compare calls to helpdesk,or speed and stability statistics, grouped by router chipset or brand, then we could remotely identify patterns and then try to reproduce them - eg if it looks like a certain router has a certain issue, we could call that customer and get the router in to try to reproduce it, and then work backwards to the root cause etc).

So we found that there was a part of the DSL connection sequence which passed the router signature on authentication. There were 5 specific SNMP MIBs[2] which may be useful, so we setup to strip these results out, and capture these out to a text output.

This was all either Ascii, Hexadecimal code, or all sorts of other variations (Examples below) [3]

The problem is

1) There are 40,000 unique "brands" or versions of ADSL/DSL/VDSL routers which I am aware of by different names.
2) These are manufactured mainly in China by 5 main manufacturers
3) Each router can have as many as 10-40 different software versions (and each update is to fix a known bug)
4) Each time the modem brand wants a cheaper price, or runs out of stock from one manufacturer, they can jump to the opposition who then often remakes the firmware to the brand's specifications (but can change the MIB identifier). So you have 5 companies, each creating slight different permutations to the hardware (but running various firmware over top).
5) Some didn't use the MIBs correctly (I seem to recall one hardware guy must have got lazy as I saw a lot of routers with the identifier as "I am making a router!" or something similar.
6) Some used a serial code identifier (but with a specific pattern eg first 4 characters might indicate the modem maker)

How I found them out

* So I summed up the common ones to get a count of common router IDS.
* I searched for these common identifiers and found a bunch of chipset names, or chinese manufacture codes, or worked out that it was the start of a mac address etc.
* I scraped some DSL router sites, and I had spent heaps of nights doing this research so in the end I got a few PA's in India to search and contact all the modem manufacturers that they could (in my own time and at my own cost).
* I created a script to generate alternate spellings for each router (eg D-Link 502T could also be D[space]Link 502-T, Dlink 502, etc. ad nauseum)
* Then I ran this against our helpdesk logs (eg if client calls up and any helpdesk agent across any of the data, is diligent and records somewhere in the notes what brand or modem version they have, then we have that tied against the customer's DSL phone line, and could see that against the MIBs - problem was that the customer could change routers, so this was a fuzzy match).

* I called all the ISPs and got the router they mainly have, so we could get a strong signal of say "probably a Dynalink as this was the free router given away by Orcon"

How we used Genetics

So then we took all the hex data and converted each modem string into a "protein chromosome", there were only 16 hex codes, so we didnt use the full 22+ ...

Ajit Narayanan

Founder, Invention Labs · Author has 77 answers and 2.1M answer views

· 8y

In 2009, my team and I built an ARM9-based embedded computer (which eventually became the first version of our first product, Avaz). We designed the board ourselves, and then ported Linux onto it.

While the system would work fine most of the time, once in a while the Linux kernel would segfault and crash. We really had no idea what was causing this to fail.

I guessed that if this was a software issue, it was most likely a device driver. We stripped out pretty much every device driver except the serial port, and still saw occasional crashes.

At this point, the most frustrating thing about the situ

In 2009, my team and I built an ARM9-based embedded computer (which eventually became the first version of our first product, Avaz). We designed the board ourselves, and then ported Linux onto it.

While the system would work fine most of the time, once in a while the Linux kernel would segfault and crash. We really had no idea what was causing this to fail.

I guessed that if this was a software issue, it was most likely a device driver. We stripped out pretty much every device driver except the serial port, and still saw occasional crashes.

At this point, the most frustrating thing about the situation was that the problem occurred so intermittently that we had no way of reliably reproducing the problem. One of my teammates had a brainwave and we over-clocked the system. To our immense relief, we started seeing the crash occur more frequently now. Once every ten reboots or so, Linux would crash during boot-up - and once in a while (perhaps once in 40 reboots) it would crash while verifying the checksum of the kernel.

It became quite clear that this crash was happening due to some complex interplay of hardware and software. The question, though, was how to proceed. None of us had any kind of experience debugging these kinds of problems.

Since we were occasionally catching a checksum error, we thought it would make most sense to start by checking the memory subsystem. We decided to focus all our energies on debugging the problem when it hit the checksum issue, postulating that fixing this issue would probably fix the other problems as well.

I would say the big breakthrough came when I edited the CRC verification code in the kernel to pull an IO pin on the processor high immediately as soon as the checksum error was detected. With this recompiled kernel, we now had a superpower: a hardware indication when the software error occurred. We hooked up an oscilloscope to each memory trace with a one-shot trigger input connected to the hacked IO pin - and voila - we were able to now track, on the oscilloscope, the sequence of signals on that memory line just before the checksum error showed up.

If I remember right, the first 30 lines (out of 32) gave us no insights. I must admit we had almost given up after checking the lowest 10 or so bits, since those were the ones that changed most frequently. I think it was one of our technicians who offered to stay late and check all the lines. And sure enough, when examining the 31st line, he found an anomaly - whenever we saw a checksum error, the signal on this line was noticeably slower than the other lines. Some kind of jitter was happening, and this was causing data corruption on the line between the RAM and the processor.

The technician called me late at night and I showed up at the office to investigate a bit deeper. It was soon evident to me what had happened - the guy who had laid out the PCB had taken great pains to equalize the lengths of the traces on all the memory lines (so that all the signals would have almost exactly the same distance to traverse between RAM and processor) but for some reason, this line alone was routed through an inner layer as a significantly longer trace.

Once we identified that, it was easy enough fixing the trace and building a new board - and despite my instinct telling me there would be many more such issues that we would have to painstakingly unravel, that one turned out to be the last hardware issue we encountered. Linux (and our application) ran perfectly reliably on the new board, and we were able to get our product to the market, albeit a couple of weeks behind schedule.