Record an IT system troubleshooting and review

Recently, a project encountered a major failure, which attracted great attention from the person in charge of Party A, and directly @our Leader, it took us two days from the occurrence of the failure to the basic solution.

After that, the project team also spent nearly two hours reviewing and summarizing:

(1) The reason for the failure.

(2) Troubleshooting.

(3) How to prevent the failure from happening again:

Strengthen the early warning mechanism to quickly identify problems;
Notify members of the project team when a warning occurs, not just one or two of them;
Pay attention to the early warning, and it needs to be resolved within 2 hours after receiving the warning.

(4) If such a problem occurs again, how to solve it.

Through the review meeting, everyone reached a consensus and discussed the response plan to improve the follow-up work. But there is another question that got me thinking:

This kind of failure is not the first time, why has not been well resolved before?

As the main person in charge of the project, how did I follow up on such failures before?

Notify the background Latest Mailing Database programmer, the programmer generally restarts or starts multiple threads, and after a day, the problem can basically be solved.

Then everyone is busy, and there is no formal review of the cause of the failure and the way to prevent the failure.

So why didn't a replay be done?

(1) For such problems, if they need to be cured, it may involve reconstructing the system. No simple and quick solution has been thought of, so the solution to the root cause has been shelved.

(2) Considering that this kind of problem does not cause serious consequences, it can be dealt with in a simple way to reduce maintenance costs.

Facts have proved that simple solutions cannot reduce maintenance costs. The workload of programmers seems to be reduced, but the maintenance work is directly passed on to me. Small problems caused by imperfect systems have caused me to communicate with Party A. Not a lot of work, taking up part of my time

(3) As the project leader, I did not ask for help from my superiors and applied for resource assistance.

It suddenly dawned on me that it was important to ask for help in a timely manner.

Mainly because I found that the Leader attached great importance to this failure, and tracked and supervised the relevant personnel throughout the process. (similar glitches have happened before, but didn't follow up in depth)

One of the reasons why Leader tracks the whole process is: when we communicate with programmers about problems, we find that there is no way to do anything about this kind of failure, and there is no other way but to wait.

This means: if this kind of failure occurs again in the future, we still have no way to solve it... So the Leader followed up the whole process and supervised the technical person in charge of the failure review.

Similar failures have occurred before, but I did not actively mobilize the resources of the leader and the technical director, and did not convey the seriousness of the problem to them, nor did I attract their attention.

And when I found that there was no perfect solution to the problem, I didn't give timely feedback on the helplessness and helplessness of encountering such problems. Instead, take a short-sighted approach to the problem and avoid the underlying problem.

Record an IT system troubleshooting and review

Subscribe Form