TECH

Could AIOps products save the TSE from a pinch?

ぱくたそ素材

On October 1, 2020, the Arrowhead system on the Tokyo Stock Exchange was down for the entire day. The cause was that a failure occurred on the positive side of the storage system, which was configured as redundant, but the failure was not switched to the secondary system.

Let’s consider whether this could have been avoided by installing the recently popular AIOps product.

The bottom line is that HPE InfoSight equivalent storage monitoring software will not do it, but a full-fledged AIOps product like New Relic will.

Virtual configuration

No official configuration seems to have been announced, so for now, let’s just consider Pure Storage’s “zero RTO, zero RPO, no third site required” system.

Incidentally, the TSE has a disaster recovery site in the Kanto area, and it is ready to restart within 24 hours in the event of a disaster.

In that sense, I guess you could say that the Osaka Securities Exchange is the third (or second?) site. It is operated without relying on the TSE, and unlike the Nagoya Stock Exchange and others, this time there was no suspension. (Come to think of it, TSE is also considering a disaster recovery site in the Keihan area)

My experience with another system is that the really important systems are handled in two completely separate systems. It’s like the old fault-tolerant system.

However, if you have full redundancy on a stock exchange, it’s a waste of resources. If that is the case, it is more efficient to use two systems like the TSE and OSE. If you are applying for an IPO on the TSE or the OSE, you may be affected, but let’s think that it is unavoidable.

Well, in the example of Pure Storage’s configuration referenced in the URL above, it’s not a positive system / positive system, but if it’s a full redundancy system, it runs the same process and uses the results obtained there against each other. So if one of them stops, there’s no systemic impact.

That is to say, the current configuration was not a complete duplication. That may seem a bit poorly done for a TSE system, but let’s treat that as a separate story.

Not with InfoSight

Now, if you have a system like Pure Storage, how far would HPE InfoSight go? I’ve heard that you can manually switch systems as long as a monitor is there 24/7. It seems to be fine if it’s on the order of a second.

But unfortunately, Nimble InfoSight cannot support this. Although it monitors the uptime of the servers, storage and network, it is “collecting data once every five minutes”, they say. This is not enough to meet the requirements.

It is possible to monitor for signs, but this time it has been less than a year since it was introduced (comment at the TSE press conference). It’s under warranty, so it’s a time when there won’t be any Sony-timer-like failures.

I understand that InfoSight has successfully consolidated 100 PostgreSQL units into 16 Vertica units, but customization is not possible because HPE engineers are in charge of data analysis, and Pure 1 of Pure Storage also has a limit of once every five minutes. It’s quite a tough place to be.

New Relic can do it

On the other hand, a full-fledged AIOps product like New Relic can handle it. First of all, we use our proprietary NRDB to process large amounts of streaming data in real time. You have to incorporate agents and plug-ins, but sampling can be done in 0.5 second increments.

And the big difference between InfoSight and New Relic is that InfoSight monitors patterns of data from a single site over time, whereas New Relic observes changes in multiple data series over time. This is accomplished using machine learning and Deep Learning, which is why it’s called AIOps.

Oh, no, there are display templates for storage administrators as well as system administrators and application developers, and they are highly customizable! is high. This is where the difference between just AI and “AIOps” comes in.

Some vendors of AIOps products use graph DBs to create data for training. This way, if a memory failure is detected, as in this case, the administrator can see what impact it will have on the system as a whole, which is very easy to understand.

Also, the monitoring point is not only limited to storage, but can include applications and the results of processes performed by applications as dummies. This can be done in JP1, but it is not possible to monitor every second in JP1.

Of course, even with the AIOps products, it is not possible to solve all the problems. Moreover, large amounts of storage and compute resources are needed to store large amounts of data.

However, in the case of the TSE, the budget is tight, but it is worthwhile to consider implementing.

In the current TSE system, the sub-storage system is performing life and death checks, and is also being monitored by traditional ITOM software. (If the conventional system collects sub-storage data directly, the system administrator is buried in a pile of alert information and can’t get to the root problem.)

If this is done at the application level with level monitoring every second, and if something goes wrong it can be traced back to the storage, then the person in charge can focus only on the critical problems that affect system operations.

I mentioned in another thread that it is difficult to assign people to storage monitoring/switching, but if you think about it, TSE-level systems have maintenance staff on hand. If you only know the problem and its root cause/response, you can have the maintenance staff work onsite at the discretion of the person in charge.

Also, since the system is highly customizable, it is easy for the person in charge to be on duty instead of being on call 24 hours a day or 365 days a year. (Even so, it’s a Japanese business person who can’t let go of his phone (cell phone).

Wrap Up

So, if you had implemented an AIOps product such as New Relic or App Dynamics instead of the traditional ITOM product or HPE InfoSight/Pure Storage Pure1, you could have avoided the suspension.

Fujitsu doesn’t appear to carry AIOps products, but it might be a good idea to get a system in place to handle them as soon as possible.

(Both companies attended the keynote speech at CloudNative Days Tokyo 2020 in September.)

————————–
Sei Yotsuba