タイトル写真は、「おわび返金、全契約者に 障害の影響、特定難しく KDDI」朝日新聞、2022年7月30日記事から引用。
この報告書をわかりやすくしようとした努力はわかる。しかしそれでも、ネットワーク技術を学んで、実際に仕事で携わった人でないと、原因部分の理解は難しいと思う。例えば、PGWという言葉が説明なく登場しているが、これはPacket data network Gatewayの略称である(そう書いてもわからないだろうけれど)。そこで、去年のドコモのトラブルと比較した表を見てみよう。

フールプルーフ(foolproof)、またフェイルセーフ(fail safe)という言葉がある。下の記事に詳しい説明が載っている。
KDDI's Worst Communication Failure Ever, or "Foolproof"
The title photo is taken from the Asahi Shimbun, July 30, 2022 article, "Apology refund, impact of disability on all contractors, difficult to identify KDDI".
This table does not show the time from the occurrence of the trouble to the resolution, but it was four and a half hours at Softbank in 2018, 29 hours at DoCoMo in 2021, and 86 hours at KDDI this time. This was the biggest trouble ever for a mobile phone service, both in terms of the number of people affected and the time it took to resolve the problem.
I understand the effort to make the report issued by KDDI easier to understand. However, even so, I think it is difficult to understand the cause unless you are a person who has learned network technology and was actually involved in the work. For example, the word PGW appears without explanation, which is an abbreviation for Packet data network Gateway (though you wouldn't know if it was written that way). So, let's take a look at a table that compares with last year's DoCoMo troubles.
As you can see in the table in the article comparing DoCoMo's troubles in 2021, both DoCoMo and KDDI are having troubles due to the work they have done. It is also common that, based on that work, heavy communication congestion occurred, and it took a considerable amount of time to resolve it.
According to this article, the cause of KDDI's trouble is that "An operator accidentally used an old runbook."
It is said that a setting error was made in the maintenance work of the "core router" that sorts data at the KDDI base in Tama City, Tokyo on the early morning of the 2nd. In the general task of exchanging equipment, operators mistakenly used old runbooks when switching data paths. There are new and old runbooks, and he didn't notice them during the work. At the press conference, Managing Director Kazuyuki Yoshimura said, "It's a management problem. The execution instructor is now in a state where he can choose the old runbook." Due to a setting error, the data became "one-sided". The location registration of the user's terminal was retransmitted many times, causing "congestion" in which the line was congested. (Omitted) It is said that the distribution volume of signals, etc. has increased up to about 7 times that of normal times. A technician tried to recover while controlling the traffic, but a new abnormality was found and it took about 61 hours for the failure to be resolved.
Also, in another article, it is written like this.
"I think (this communication failure) had to be prevented. I said that it was a misconfiguration (of the router), but it was an instruction error in the work. The operator worked as instructed. "
The above remark was made at the press conference of KDDI President Makoto Takahashi on July 29th. In other words, the person instructing the operator gave the wrong runbook.
There is the word foolproof and fail safe. You can find a detailed explanation in the article below.
Failsafe and foolproof are commonly designed on the premise that "people make mistakes". Fail-safe is to protect safety even if it fails, and fool-proof is to prevent it from failing in the first place.
Applying this to the KDDI trouble, as a foolproof measure, this is what can be done to prevent procedure manuals from being misplaced. For example, old procedure manuals should be placed in a separate folder to emphasize that they are "old," and at least two people should check whether the procedure manual is correct. Another possible fail-safe would be to be prepared to resolve congestion as soon as it occurs.
After the occurrence of DoCoMo's failure, KDDI has secured a procedure to recover immediately even if congestion occurs and is reviewing the system design. After this failure, KDDI took the same recovery procedure as DoCoMo. However, it still did not fit.
KDDI also took such measures, but problems still occurred. No specific information has been released on what kind of preparations were made, how many people, and how they were managed. But it is certain that something was missing.
"Did you rehearse in the test environment in advance?" "Do you usually do mock training?" "Or was the worker off guard because the work had been done many times?" I would like to ask such questions.
<Supplementary article on August 1, 2022>
I read the detailed article of the press conference introduced in this person's article.
The procedure manual is in the form of a master procedure manual, and all the procedures are confirmed in the production environment and the simulated test environment. There are two types of procedure manuals that confirm the procedure, the old procedure manual and the new procedure manual, and this time the old procedure manual is instructed incorrectly. The old runbook is also being tested in the old environment, and the new runbook is being tested in the new environment, both of which are correct runbooks.
However, the fact that the instructions were wrong is the event this time.
There is always approval work on what to do based on the runbook. During that process, I checked to see if the runbook was the latest file, but the approver didn't notice the mistake.
I don't know the details, but it's embarrassing, whether it was handled by an unfamiliar person, misunderstood, or overlooked. The system may be too big for humans.
<Supplementary article so far>
On July 3, KDDI President Makoto Takahashi held a press conference on the trouble and explained the situation himself. Usually, the explanation about the system is often given by the person in charge of technology or the CIO. On the Internet, I saw praise for the president's ability to answer the reporter's questions immediately.
Compared to the terrible press conference at the time of the Seven Pay trouble, it was certainly handled properly. However, even if his answers to questions are good, the manager who caused such a serious accident is incompetent.
Looking at the reported history of Mr.Takahashi, he may be a highly skilled engineer, which is rare for a manager. However, even if one's own technical ability is high, it does not necessarily mean that one has the managerial skills to run subordinates.
I have heard many times in the past of cases where capable people tend to set themselves as the standard and thus tend to think that their subordinates are as capable as they are. However, what is most required of the president is the ability to get his subordinates to do their jobs properly.
I am not sure if this article is correct. It may be that the reason they had to reduce expenses was because of the circumstances described in the article I wrote.
However, if this article is correct, there should not be a manager who discourages workers but tells them to do it properly. The management position should be handed over to someone who does it right. The annual salary of 200 million yen should be given to someone who is worthy of it.
IT companies remain labor-intensive. AI may eventually come to cover the areas that people are doing, but for the time being, this is not likely to change.