I believe everyone has some kind of problem or failure in their system that happens from time to time, that is not that terrible but it hurts the system. This is a short post about exactly that kind of problems and what we did to fix them.
My company uses a very large continuous integration system and software testing is split into about twenty stages. When one stage finishes, the testing moves on to a next stage. In our case this whole process typically takes two days to complete. Now, from time to time we experience all sorts of issues in our testing infrastructure: testing boards stop working, there might be network connection issue or some otherwise working scripts simply decide they want to take a day off. One stage fails, and the whole process needs to be repeated again.
We tried really hard fixing all the issues with the testing infrastructure, but without success. There is simply no way we could have everything working properly all the time, the system is huge and there are many points of failure. Needless to say that the developers were angry, since every time there was a failure they had to rerun their tests and it took another two more days to get the results.
So we got to an idea that many people had many times: we cannot fix all the issues, so let’s make recovering from them as cheap as possible. We implemented a fast recovery mechanism, so if the testing job fails because a problem in the testing infrastructure, the user could simply restart it. Instead of repeating all the stages, the system would skip those stages that already passed and only run those stages that failed.
Instead of waiting for two days, the waiting time was now on average few hours. Far from perfect, but a huge improvement compared to the previous state.
Second example: earlier in my career we were developing a TV box and there was an issue with the box slow down. TV box would run well for the first few hours, but after several days it would get slower and slower to the point of being useless. Reason was for the slow down was memory fragmentation. And the solution? Instead of implementing all sorts of expensive tricks and workarounds, we simply rebooted the box at 3 o’clock in the morning when nobody was watching. Problem solved.
And the result
Even though these solutions are far from perfect, they get the job done. When failures happen they aren’t such a big deal and we continue running. Starting from the position of unsuccessfully trying to fix an issue and finishing by going around it, all we needed was a simple shift in perspective.