'Knightmare' Spotlights Crucial Role of Stress Tests

Technical errors have been blamed by Corporate America’s most senior executives for a string of outages this year that have racked up hundreds of millions of dollars in damages, drawing attention to the need for more extensive technology testing.

Knight Capital (NYSE:KCG) saw its entire survival threatened this week after a rollout of new software proved incapable of handling the realities of daily trading, leading to erroneous trades that caused $440 million in losses, or close to double the market maker’s quarterly revenue.

The company blamed the error on a “technology issue” following the software’s installation. If that sounds familiar, it’s because technology has become the excuse du jour for when things go awry.

“You can attempt to use best practices and do everything within reason but you can never predict when a glitch or spike in traffic is going to bring operations to a screeching halt,” TechSavvy Global CEO Scott Steinberg said.

Nasdaq OMX (NASDAQ:NDAQ), which botched Facebook’s (NYSE:FB) $16 billion initial public offering in May, similarly blamed a glitch that caused mass confusion over order execution on “poor design.”

“We’re going to have a lot more outages and probably some big ones before this is over.”

- Enderle Group analyst Rob Enderle

And stock exchange BATS blamed a serious hiccup during its initial public offering in March that temporarily halted trading of Apple’s (NASDAQ:AAPL) stock on a “serious technical failure,” while the Royal Bank of Scotland (NYSE:RBS) attributed a backlog of payments one night in June that crippled its system to a “computer glitch.”

Of course, there’s also the Flash Crash of 2010.

Tech failures have also been blamed for problems outside the financial services industry, including the Alyeska Pipeline spill two years ago that leaked 190,000 gallons of oil and a steam tube leak at a nuclear power plant last month attributed to “modeling errors.”

In a world where reliance on technology grows by the minute and human capital struggles to keep pace, a single glitch in a high-volume system can be catastrophic.

The growing prevalence that has already led to hundreds of millions of dollars in losses has put further pressure on industries to strenuously test new products and rewrite contingency plans.

“We really don’t have good stress tests that measure real world events,” said tech analyst Rob Enderle of the Enderle Group. “You can replicate events from the past, but anticipating what will happen in the future is incredibly difficult.”

Partly to blame is financial turmoil that has forced companies to scale down in recent years through budget and staff cuts. Many just don’t have the financial means to test for events they deem extremely unlikely.

Surely Knight Capital put the software it launched this week through extensive testing prior to launch, but as in all of these cases, it may have become too costly to test for unforeseeable circumstances.

“You may test server load if it’s hit by 100K in an hour but what happens if suddenly a million arrive?” Steinberg said. “Despite the best intentions, even the most well crafted stress test programs can’t account for every individuality.”

Trying to stay ahead of constantly advancing technology, companies sometimes move too expeditiously to test for everything, while others may overlook extensive searches fearing they won’t have the time or money to fix potential shortfalls.

“What’s happening now is we’re rolling out applications so quickly and into many more environments,” said Randy Clark, software company UC4’s chief marketing officer. “That combination is a recipe for failure.”

Perhaps some of the detrimental outcomes of these glitches could have been avoided or at least minimized if companies simply prepared for failure.

Since many of these errors, both human and technological, may be unavoidable without hard-to-come-by unlimited stress test capacity, companies at certain inflection points, such as an IPO or new product launch, should anticipate road bumps and lay out back-up plans. Having extra human capital and tools at the ready may stop a malfunctioning process before it gets out of hand.

“We live in a world you can’t really predict,” Enderle said. “Plan for the fact that you’re going to have a catastrophic event and then figure out best and fastest way to recover from it.”

At the same time, new technologies (if thoroughly tested, of course), such as software made by UC4 that automates the release of new businesses, could help IT better monitor system launches. UC4’s software also allows companies to roll back to the previously used system if necessary.

“These people are trying to automate millions of transactions and ensure all different work flows and the only way they can do it is with greater visibility and control,” said Clark, whose company’s software pledges to eliminate more than half of data center app failures.

Of course, part of that preparation is communication. Nasdaq CEO Bob Greifeld was criticized in the aftermath of the Facebook fiasco because he was off the grid for five hours while flying from Silicon Valley to New York City as the debacle unfolded.

During his absence, investors and the Securities and Exchange Commission were desperately hunting for answers.

"You have to trust your people but also have to maintain open lines of communication when a fire does pop up," Steinberg said. "It's like Murphy’s Law – if it can it will fail, expect it to do so and at the worst possible time."

Perhaps the growing problem, which has been highlighted these last few months under such high-profile cases, has captured the world’s attention and increased the necessity to test.

“I think there’s rising industry awareness around it,” Steinberg said. “The issue is there are so many potential points of failure that to some extent it’s a game of chance.”

At the same time, it sometimes takes an extreme catastrophe before a problem is fixed and Enderle said he thinks there are going to be “a lot more outages and probably some big ones before this is over.”

“The reality is, regardless, computers are here to stay,” Tenxer CEO Jeff Ma told FOX Business. “The question is how do we tighten things up so when there’s a mistake in the computers it doesn’t go rampant?”