At my work place we were made aware of some instability issues and the first place to look is of course the logs. It had apparently been happening for a while, but the most important parts of the system were not affected noticeably. This was early morning and we didn't have much traffic.
We noticed a lot of network connection errors for SQS, so one of the team members started blaming Amazon.
My response was that if SQS had these instabilities, we would have noticed posts on HN or downtime on other sites. The developer ignored this and jumped into the code to add toggles and debug log messages.
I spent 2 more minutes and found other cases where we had connection issues to different services. I concluded that the reason why we saw so many SQS errors was because it is called much more frequently.
The other developer pushed changes to prod which caused even more problems, because now the service could not even connect to the database.
He wrongly assumed that the older versions of the service, which was still running, was holding on to the connections and killed them in order to make room for his new version. We went from a system with instabilities, to a system that was down.
The reason was a network error that was fixed by someone in the ops team.
Example:
At my work place we were made aware of some instability issues and the first place to look is of course the logs. It had apparently been happening for a while, but the most important parts of the system were not affected noticeably. This was early morning and we didn't have much traffic.
We noticed a lot of network connection errors for SQS, so one of the team members started blaming Amazon.
My response was that if SQS had these instabilities, we would have noticed posts on HN or downtime on other sites. The developer ignored this and jumped into the code to add toggles and debug log messages.
I spent 2 more minutes and found other cases where we had connection issues to different services. I concluded that the reason why we saw so many SQS errors was because it is called much more frequently.
The other developer pushed changes to prod which caused even more problems, because now the service could not even connect to the database.
He wrongly assumed that the older versions of the service, which was still running, was holding on to the connections and killed them in order to make room for his new version. We went from a system with instabilities, to a system that was down.
The reason was a network error that was fixed by someone in the ops team.