Down, not out – What happens when service providers go dark?

The recent CrowdStrike outage affecting over 8.5 million Microsoft Windows devices and causing worldwide disruption has raised questions around how firms can ensure their operational resilience can withstand critical service providers ‘going dark’.

30 July 2024 9 mins read
By Jay Hampshire
Written by humans

Written by a human

In brief:

  • On 19 July, 2024, a “routine software update” from endpoint security provider CrowdStrike led to “the largest IT outrage in history,” with 8.5 million systems affected
  • With financial firms, airlines, healthcare services, and media outlets affected, the incident raised alarming questions about operational resilience and overreliance on third-party providers
  • Financial regulators have been increasing focus on operational resilience requirements for some times – but the right tech solutions can help firms both stay compliant and avoid ‘going dark’

The technologies we use every day rely on an ever more complex web of connections and providers. While we may think that a particular service operator is responsible for an application or platform, scratch the surface and hidden depths of interconnected partnerships and third-parties are exposed – as well as hidden risks.

With operational resilience requirements climbing up regulatory agendas, with examples ranging from the Digital Operations Resilience Act (DORA) to FINRA Rule 4370, firms are under increasing pressure to ensure that – should services be disrupted – they are ‘down but not out’. Because, as we’ve witnessed recently, when the dominos start falling, they fall very fast, and very hard.

Crowdstruck

On Friday, 19 July, 2024, a “seemingly routine software update” from digital security vendor CrowdStrike caused chaos, affecting 8.5 million Microsoft Windows devices worldwide. In what is being described as “the largest IT outage in history,” banks, airlines, healthcare providers, television stations, and businesses experienced a “blue screen of death” and were unable to utilize their systems.

The unprecedented scale of the issue was compounded by the proposed fix, with CrowdStrike informing customers that they must manually restart every affected system, remove the update file, then restart again. Allie Mellen, IT analyst at Forrester, detailed that this could involve “manual, hands-on keyboard work … for hundreds of thousands of affected machines” – an incredibly resource intensive process.

Microsoft vice-president David Weston reassured those affected that the 8.5 million systems impacted represents “less than 1% of all the Windows machines worldwide.” But when that supposedly minimal figure includes some of the world’s largest institutions, and resulted in near total loss of service, it raises questions about operational resilience in the face of monopoly. Microsoft, Google, and Amazon account for two-thirds of the entire cloud provider market between them, and CrowdStrike itself close to a fifth of the endpoint cyber security market.

Fewer providers at the top, giving consumers less choice, makes it much harder for organizations to build redundancy and diversity into their operational resilience strategy – especially when those providers rely on a web of third-party providers to operate.

Global financial services firms like JPMorgan Chase, UPS, and Bloomberg were affected, leaving traders unable to access systems and perform trades. While this scale of disruption is unprecedented, financial regulators have been steadily increasing their expectations around operational resilience to ensure market stability in the face of disruption – and the CrowdStrike situation will no doubt light a fire under this regulatory drive.

How can firms ensure a higher standard of operational resilience?

Regardless of the size of an organization, building a technology ecosystem entirely from scratch would be cost, time, and resource prohibitive. Reliance on service providers is an accepted part of operating a business. But overreliance on service providers can be problematic, and expose a firm to risk.

As we’ve witnessed, all it takes is for one weak link in a service provider’s chain to cause widescale disruption. For firms looking to optimize operational resilience, diversification of partners can be vital – as can selecting the right one.

1) Solutions that are built in-house

The more links there are in a chain, the more the odds of a weak link increase. The increasingly interconnected ecosystem of service providers means that some platforms – even household names like Microsoft – rely on smaller vendors for certain services and, should those tertiary providers go down, there’s the potential for a domino effect.

By selecting partners that build their solutions in-house and don’t rely on other providers to buy or deliver services, firms can be confident that those solutions are solid, integrated properly with their systems and workflows, and they have a single point of contact, truth, and accountability when required. These solution providers are also able to work with their customers to build bespoke or tailored apps and platforms to better suit their unique needs, rather than relying on off-the-shelf solutions.  

2) Private cloud solutions and infrastructure

Working with providers that build infrastructure and host data on private cloud, rather than widescale public cloud services like AWS or Azure, gives firms better control. Big private cloud service providers are tempting targets for cyber criminals and rely on interconnected service providers, meaning there’s potential for increased outages, downtime, or data loss should those third-parties go dark.

Private cloud providers give greater confidence and stability – firms can be confident in where their data is held, in dedicated data centres and not ‘in the mix’ of a potentially confusing public cloud architecture. They are also able to focus on ensuring security, with fewer connections and endpoints to manage, and with direct oversight of any coming updates or systems changes from end-to-end – so they won’t surprise you with unexpected downtime.

3) Decrease reliance on a complex web of systems and providers

By partnering with a single service provider that supplies an end-to-end solution, firms are able to cut through the increasingly complex web of third-party systems and providers. This has considerable benefits to systems integration, as they only need work with one provider to integrate into systems and workflows. It also gives clear benefits in cost reduction, as firms are only paying one set of fees, and employees only need to take the time to learn one platform and user interface.

Reducing the number of systems and providers also reduces the number of potential endpoints that could be exploited by bad actors, and the number of routes where unforeseen integration or update issues could affect your operations.

4) Always-on support, not social-media confusion

When large scale disruption hits, often communications – official and unofficial – take place across social media, via X updates or statements on company profiles. Finding out that services or down, timelines for fixes, or getting lost in the swirl of rumours and misinformation when business critical services are offline can lead to confusion and slow down the rate at which situations can be resolved. When CrowdStrike went down, the internet was awash with rumour of apparent ‘fixes’.

Having access to a solution provider with dedicated, 24/7, ‘always-on’ support means that, whether discussing business-as-usual troubleshooting, integration and setup questions, or a more substantial issue, firms are able to get the answers they need efficiently. When business critical services are affected, time is of the essence, and being able to contact a dedicated member of technical support, fast, can make all the difference.

Hearing about issues, updates, and fixes directly from your vendor, rather than having to navigate the potential misinformation and slow response times of social media, gives you a direct line to getting operations back up and running if the worst should happen.

Where do regulators stand on operational resilience?

DORA

The Digital Operational Resilience Act (DORA) is a European Union regulation that entered into force in 2023 and applies to all regulated businesses that operate in the both the U.K. and EU. The act includes five pillars that firms need to ensure they act on:

  • ICT risk management, including implementing internal and external controls and governance strategies
  • Third-party risk monitoring and management via key contractual provisions
  • Regular operational resilience testing across basic and advanced levels
  • Establishing an incident reporting framework including factors such as customer impact, downtime duration, economic impact, and geographical spread
  • Information sharing across the industry so firms are notified of potential operational and cyber risks

FCA

The U.K.’s Financial Conduct Authority (FCA) has established final rules and guidance around operational resilience in PS21/3: Building operational resilience. These rules require that “by no later than March 31, 2025,” firms including banks and insurers must have:

  • Performed mapping and testing so that they are able to remain within impact tolerances for each important business service
  • Made the necessary investments to enable them to operate consistently within their impact tolerances
  • Identified any vulnerabilities in their operational resilience
  • Included third-party services in scenario testing (or ensuring they’ve conducted their own)

OCC

In March, 2024, the Office of the Comptroller of the Currency (OCC) in the U.S. set out a focus on “exploring baseline operational resilience requirements for large banks with critical operations, including third-party service providers.” The five “baseline requirements” that the OCC will be focused on include:

  • Establishing clear definitions for identifying critical activities and core business lines
  • Defining tolerances for disruption
  • Requiring testing and validation of resilience capabilities
  • Incorporating third-party risk management expectations
  • Stipulating clear communication expectations among stakeholders and counterparties

CFTC

The Commodities and Futures Trading Commission (CFTC) proposed a rulemaking in December, 2023, that would require firms to establish risk appetites and tolerance limits, and an operational resilience framework built on three pillars

  • Information and technology security
  • A third-party relationship program to manage risks presented by mission-critical third-party service providers
  • A business continuity and disaster recovery plan

FINRA

The Financial Industry Regulatory Authority (FINRA) has set rules requiring firms to establish written Business Continuity Plans (BCPs), including Rule 4370. FINRA has outlined that a BCP should include:

  • Annual reviews of an established plan to ensure effectiveness
  • Automated data back up and recovery that is readily accessible
  • Identifying mission-critical systems
  • Risk assessments of critical third-party vendors and response plans for disruption to their services

One of the first crucial steps towards increasing operational resilience is to take stock of your third-party partners and solution providers. Knowing your vendors, understanding the services they operate, and how those solutions can support and enhance your operational resilience strategy is vital in ensuring your organization is as resilient as possible – and meeting growing regulatory requirements.

 

SUPPORT 24 Hour