Outages are expensive for any industry. For airlines, the problem is multiplied by the impact of cascading failures and cascading costs. How much money did Southwest, Delta and American Airlines lose when planes had to be grounded recently due to a computer outage ?
The cost of downtime goes beyond the obvious inability to take orders and receive payments. How do you measure the impact to your brand when customers miss major milestones due to flight delays — especially when roughly three-quarters of passengers only fly once a year? In many of those cases, you have lost them to a competitor forever. What about the fact that stock prices drop when software fails?
Even more important: As flights become increasingly dependent on software, there are actual lives at stake in making sure it works as expected. The most recent airline failure was caused by a problem with AeroData — a program that helps manage things like weight and balance while in flight. The hard truth is that the increasing complexity of the airline business means that these failures will keep occurring unless something changes. As technologists, we’ve become great at identifying issues and reducing mean time to resolution (MTTR), but what we need to encourage is taking a more proactive approach.
An effective way to build resilience is to go looking to cause failures intentionally, in order to see how a system responds and build up an immunity. In the software world, we have a term for this: chaos engineering. Airlines can prepare for potential failure scenarios by being proactive and improving the systems’ ability to withstand failure. Airlines already have rigorous standards for the mechanical and software components of their planes, but in today’s world, the bar needs to be just as high when it comes to their on-ground support software.
Another contributing factor is that a significant portion of the industry still relies on systems that were initially designed 50 to 60 years ago. How do you design, create and test systems that are capable of replacing these old stalwarts while enhancing the resulting speed and capacity? What should you test during that process? It is no longer sufficient to test using only functional tests and integration tests during your software build process and to do user acceptance testing at the end. Airlines must also test for vulnerabilities and security. They must make sure to test for regressions. They should also test for resilience.
Third-party dependencies add another layer of complexity. After airlines validate that their own software and systems are resilient, there remains a crucial additional step to make sure those systems can gracefully handle what happens when their dependencies fail. This specifically includes the dependencies they don’t own but have outsourced. In March, Business Insider reported that “flyers on American Airlines and JetBlue reported long lines and confusion at airports ranging from Las Vegas to New York, Boston to Atlanta, and others,” due to a technical issue with Sabre — the largest global distribution systems provider for air bookings in North America.
Airlines are responsible to customers for these problems, even if they are caused by third-party vendors. Customers won’t care why the issue is happening — they care that they are unable to use the service that they purchased and has your name on it. Trusting a third-party vendor as a dependency is accepting significant risk, regardless of the service-level agreements and uptime promises that are in place. It’s critical that airlines figure out how to mitigate that risk by examining how their systems are affected when that dependency is not available in any way, small or large, and designing the means to minimize the damage to their systems, their brand, and their reputation.
The more airlines choose to treat operations as a genuine part of their value stream and not merely a cost center, the better. Money spent in IT that helps them become proactive in finding potential problems and mitigating against them — before they must become reactive to a catastrophic problem — is money well spent.