Difference between revisions of "Resilience"

From MgmtWiki
Jump to: navigation, search
(Marconi Society)
(Marconi Society)
 
Line 98: Line 98:
 
The software supply chain emerged as a critical concern, with participants noting widespread dependence on poorly validated and under-funded open-source libraries. This led to recommendations for developing systematic curricula for Internet-scale infrastructure operations, moving beyond the current reliance on anecdotal “war stories” for training. The workshop identified key audiences including network operators, engineers, and C-suite executives (CIOs, CFOs, CISOs), with regulators and policymakers as important secondary audiences globally.
 
The software supply chain emerged as a critical concern, with participants noting widespread dependence on poorly validated and under-funded open-source libraries. This led to recommendations for developing systematic curricula for Internet-scale infrastructure operations, moving beyond the current reliance on anecdotal “war stories” for training. The workshop identified key audiences including network operators, engineers, and C-suite executives (CIOs, CFOs, CISOs), with regulators and policymakers as important secondary audiences globally.
  
The workshop established nine comprehensive workstreams addressing best practices, accountability protocols, infrastructure support mechanisms, operational practices, and talent development:
+
The workshop established nine comprehensive work-streams addressing best practices, accountability protocols, infrastructure support mechanisms, operational practices, and talent development:
 
# Best Practices Framework/Badges
 
# Best Practices Framework/Badges
 
# Accountability, Agency and Risk Management
 
# Accountability, Agency and Risk Management

Latest revision as of 14:31, 11 May 2025

Full Title or Meme

Resilience of any complex ecosystem is the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly.

The opposite of Fragility.

Goal

  • In Identifier and Access Management systems resilience must be framed as the ability for any user to access their resources safely and with minimal complexity at a reliability level that is higher than some specified minimum.
  • Safely refers to access of records without exposing them to unauthorized users or corruption of data.
  • Minimum complexity is entirely determined by the user, but may be different from one user to another.
    • A healthcare patient's complexity must be handled by a minimum fraction of the population of the population at some level of educational ability, say completion of 8th grade.
    • A healthcare physician's complexity may also considered the cost of false positive in determining the level of complexity.
  • The Reliability will be calculated as the probability of getting access at any time and must be in the range of 99.9% to 99.999% which may be dependent on the criticality of failure to authenticate in time to preserve life and property.

Context

An oak is strong. A reed is weak. But in a terrible storm, the oak is uprooted and the reed survives. Markus Brunnermeier, an economics professor at Princeton University, takes that metaphor from the French poet Jean de La Fontaine as the theme of his new book, “The Resilient Society.” His message: Like the reed, we must bend, not break.[1][2]

  • On 2023-03-15 a new presidential initiative was announced.
  • Back in 2015, the White House introduced The National Strategy for Trusted Identities in Cyberspace (NSTIC), an initiative collaboratively bringing together the private sector, advocacy groups, public sector agencies and other organizations to improve the privacy, security and convenience of online transactions. The Identity Ecosystem envisioned in the NSTIC is an online environment where individuals and organizations are able to trust each other because they follow agreed-upon standards to obtain and authenticate their digital identities – and the digital identities of devices. To achieve this objective, the NSTIC established guiding principles for the creation of an Identity Ecosystem, developed with identity solutions that are:
  1. Privacy-enhancing and voluntary,
  2. Secure and Resilient,
  3. Interoperable and
  4. Cost-effective and easy to use.
  • Used as source of the principles for the Identity Ecosystem Framework (IDEF).
  • Unfortunately the IDEF itself was not resilient and failed to accomplish its mission. When the administration turned over, it just vanished. The take-away is the presidential commission's themselves are not resilient.

Problems

  • It seems to be a feature of any component of a Living System, (which includes all of societies' imposed structures) that the most successful systems migrate towards solutions which make for the most efficient use of the resources at their disposal. For the system as a whole to be Resilient, the inevitable failure of any subsystem that is highly leverage, but not imperil the whole system, or it will not survive change. In other words, in resource rich times, the efficient organism will have an advantage, but in times of wildly varying resources, the Resilience of the organism will be more important.[3]
    [The report] showed that resilience pays off. It is likely adding resources for resilience initially increases its costs without expanding functionally, causing an initial decline in short-term efficiency. However, cyber disruptions are increasingly likely (if not certain), which decrease the system's functionality and simultaneously increase its costs due to lost customers or users, lawsuits, and other damage. A system that prepared for resilience has lower declines in functionally and fewer cost overruns, and this advantage can more than compensate for the initial cost of adding resources.
  • The size of change most likely follows a power law, or the small changes are more frequent than the larger changes. If a system is resilient only to small changes, the the large changes will imperil the system.[4]

Examples

  • Square outage leaves sellers unable to process payments (2023-09-08) Square suffered a major systems outage that has left sellers unable to access their accounts or process payments for more than 12 hours. If you have a restaurant that uses square for all payments process, this outage is catastrophic. If you save the old zip/zap card machines as a back up, the new cards do not have embossed information. As technology expands we all become more vulnerable to events like this.
  • Extreme weather always tests the Resilience of systems on which the public depends. At the end of 2022 a storm that started in the Northwest of the US and wound up in the Northeast caused major disruptions to airlines. Most of the airlines came back into operation within a day or two, but Southwest had designed for efficiency rather than resilience. It would appear that all of the savings of efficient operation were wiped out by the weeks of failure to meet their committed flight schedule caused chaos. As the New York Times reported: [5]
    This problem — relying on older or deficient software that needs updating — is known as incurring “technical debt,” meaning there is a gap between what the software needs to be and what it is. While aging code is a common cause of technical debt in older companies — such as with airlines which started automating early — it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand. As you might expect, the former is cheaper and quicker.... if you are a corporate executive whose compensation is tied to stock prices and earnings statements released every three months, there are strong incentives to address any immediate problem by essentially adding a bit of duct tape and wire to what you already have, rather than spending a large amount of money — updating software is costly and difficult — to address the root problem. Then you can cross your fingers and hope that whatever catastrophe may be in the making, it erupts under someone else’s future tenure. Such bets often pay off since, increasingly, the plight of a company’s customers and employees is divorced from the immediate fortunes of its current top executives.
  • We cannot expect the systems that brought us to this society of high efficiency and low resilience to adapt to a system of high resilience, even if the quarterly profits do not always meet expectations. Even Einstein realized that "The thinking that got us to where we are is not the thinking that will get us to where we want to be.".
  • An example of a big changes brought about by the COVID-19 virus in 2020 was caused by United States Capitalists move to off-shoring manufacturers that involved significant amounts of manual labor as well as the just-in-time logistics theory which meant that any inventory was just unused capital. One example was the manufacture of the face masks that were critical to the health of the working combating the virus. In the mean-time the Trump White House had eliminated the disease experts in the National Security Office. The result was "A very American story about capitalism consuming our resiliency.[6] Both of these efficiencies made the country susceptible to the shortage of many clinical components, as no planning or control over the recovery of that capability. Note that the was a strategic inventory of medical supplies, link masks, but that it was depleted in the H1N1 virus emergency in 2009 and was never replenished.
  • During the reign of Jack Welch at General Electric the company prospered wildly as a result of applying vulture capitalism principles at every level of the company. Welch retired a hero. The subsequent near-total collapse of the company seems to not have been his fault, but any student of planning and control knows that optimizing for only the short term effects will eventually lead to a situation that was not planned for and cannot be controlled.
  • In 2020-11-13 the Phizer company announced a new COVID-19 drug that caused "an event that statistically never could happen"[7] but it did and all the models built to enable resilience were unable to recover. What's worse, all of the stock traders and statisticians (quants) that made money all those years with no "claw-back" provisions, learned the wrong lesson and will just go off and do it again.
  • In identifier and access management problems can be introduced by attacks which cause loss of access, so both the likelihood of loss of access and the time to recover access must be determined to the extent possible.

Solutions

  • In the end, each system must determine the level of efficiency and resilience that it desires. Too much caution will miss out many small changes that occur every day. Too much recklessness will result the the inevitable failure in the long term.

IAM Principles of Resilience based on Gartner

  1. RISK CULTURE - Stop focusing on checkbox compliance, and shift to risk-based decision making.
  2. OUTCOME FOCUS - Stop solely protecting infrastructure, and begin supporting business outcomes.
  3. BETTER FACILITATE- From defender to facilitator balance protecting with delivering business outcomes.
  4. MAKE WORKFLOW - From trying to control information flow to understanding how it flows and risks.
  5. PEOPLE-CENTRIC - Accept the Limits of Technology and Become People-Centric.
  6. INCLUSIVE - Ensure that the greatest good for the greatest number applies to the entirety of society and just just large corporations.
  7. DETECT RESPOND - Stop striving for 100% protection, and invest in detection and response.

Gartner’s researchers predict by 2017 50% of IT spending will occur outside of traditional IT department control. They note we are at an intersection of two extraordinary digital trends. These include the ongoing transformation of digital business and the ever-growing capacity and capability of adversaries..

Avoiding Risk

It might seem paradoxical, but risk avoidance does not lead to resilience. Markus Brunnermeier[2]argues that resilience can serve as the guiding North Star for designing a post-Covid-19 society, Risk is not to be avoided. It’s only by taking risks that society achieves breakthroughs. And a society that doesn’t take risks becomes fragile. “Perhaps paradoxically, enduring a small crisis from time to time can be preferable to avoiding them at any cost. A crisis is an opportunity to make needed adjustments."

Resilience in Complex Adaptive Systems

Operating at the edge of failure - The scientific basis of resilience

Systems are complex..

  • Unexpected behaviors
  • Unexpected responses to interventions
  • New forms of failure
  • Changing (in obvious and not so obvious ways)

Operators are continuously...

  • Monitoring some parts of the system (but never all of them)
  • Exploiting opportunities
  • Estimating the distance to failure (if they are not already operating under time constraints)
  • Reacting to threats (usually because of some outside information)
  • Anticipating future conditions (the human brain is like a time machine, looking backward and forward in quick succession)
  • Learning system features (typically by experience, but hopefully by study)

What are the Boundary Conditions for successful operation - Modified from Rasmussen, 1997

  • Accident Boundary - cross this and the is a problem that well could be existential - unfortunately no one knows where this is
  • ECONOMIC FAILURE Boundary - cross this and the business fails
  • UNACCEPTABLE WORKLOAD Boundary - cross this and everyone quite


Richard Cook, MD Professor Of Healthcare Systems School of Technology & Health Royal Institute Of Technology Stockholm, Sweden www.ctlab.org https://www.youtube.com/watch?v=PGLYEDpNu60

Marconi Society

  • February 11, 2025-02-11 The Internet Resiliency Workshop, organized by the Marconi Society Internet Resilience Institute, gathered more than 30 experts to address important challenges related to the resilience of Internet infrastructure. The workshop aimed to discuss the vision for a resilient Internet that we all desire and explore the steps needed to achieve it.
  • 2024-11 Washington DC = A Marconi Society Internet Resilience Institute initiative, the workshop’s objective was to discuss the resilient Internet we all want, and how to get there. The Internet’s fundamental technical architecture continues to provide a solid foundation. However, discussions identified areas for ongoing refinement and strengthening, specifically within the Border Gateway Protocol (BGP) for Internet address routing, the Domain Name System (DNS), and the Certificate Authority (CA) system. Download Report

The workshop identified four primary threats:

  1. increasing system complexity,
  2. intensifying regulatory pressures,
  3. insufficient funding for preventive measures,
  4. and software supply chain vulnerabilities.

For instance, the interdependence between electrical power and Internet infrastructure creates a “circle of dependencies” where each requires the other to function. Modern software development practices have introduced a “crisis of complexity,” with applications depending on numerous APIs and third-party services whose security is often indeterminate.

The regulatory landscape emerged as perhaps the most pressing challenge, with policy issues expected to influence Internet development over the next 10-20 years in a more direct way than before. The relationship between technical operators and government policymakers and regulators has become strained as Internet and Internet-enabled services are now embedded in every aspect of modern life. The technical community’s traditional approach of fixing problems as they arise is now politically untenable. Governments demand clear accountability and quick responses to incidents given the impact of the Internet on all aspects of the economy and national security. There is a clear need to build and maintain constructive public-private partnerships.

The workshop revealed a fundamental tension in how resilience is funded and prioritized. Participants repeatedly emphasized that “resilience is a prevention problem, and prevention does not attract money.” While reactive measures to incidents readily attract funding and attention, the crucial work of preventing failures through good operational practices, proper training, and systematic thinking about dependencies is often underfunded. This challenge is compounded by information asymmetry between different stakeholders – operators, regulators, and users often have different levels of information and understanding about incidents and their causes.

The software supply chain emerged as a critical concern, with participants noting widespread dependence on poorly validated and under-funded open-source libraries. This led to recommendations for developing systematic curricula for Internet-scale infrastructure operations, moving beyond the current reliance on anecdotal “war stories” for training. The workshop identified key audiences including network operators, engineers, and C-suite executives (CIOs, CFOs, CISOs), with regulators and policymakers as important secondary audiences globally.

The workshop established nine comprehensive work-streams addressing best practices, accountability protocols, infrastructure support mechanisms, operational practices, and talent development:

  1. Best Practices Framework/Badges
  2. Accountability, Agency and Risk Management
  3. Create a group, process or funding mechanism to support critical infrastructure
  4. Build and Promote “Always Be Rolling” Program
  5. Collaborative Exercises and Information Sharing
  6. Infrastructure and Sectoral Dependencies
  7. Education and Talent Development
  8. Governance and International Collaboration
  9. Evolving Resilience Goals
These initiatives aim to balance immediate operational needs with long-term strategic goals. The workshop emphasized connecting resilience efforts to business metrics like Service-Level Agreements (SLAs) and customer experience, while noting the challenge of justifying investment in infrastructure components that appear low value until they fail. The Marconi Society was designated to serve as a channel for raising awareness rather than implementing technical solutions directly. Discussion included plans to produce a comprehensive paper providing concrete examples and evidence for stakeholders and convening follow-on meetings that advance the understanding of these topics.

In conclusion, participants agreed that to get the resilient Internet we want, a few important things must happen: 1) improved dialogue between technical experts and policymakers; 2) better incident response frameworks; 3) systematic approaches to identifying and managing complex interdependencies; and 4) learning from best practices in other industries (for example, power, telecom). Research should be conducted to evaluate best practices in other critical infrastructure sectors, including inviting relevant experts in those fields.

The workshop recognized that Internet resilience is part of a complex interdependent system and that dependencies must first be identified to provide a foundation for future building blocks. The path forward involves partnering across sectors with technical organizations, academic institutions, civil society organizations, and Internet governance bodies to amplify the message and reach key stakeholders, while addressing the persistent challenge of funding preventive measures over reactive responses.

Design and Test

  • Most design looks only at the common use cases.
  • Design for Resilience requires use cases that are at the edge of performance and attack from malicious or untrained users.
  • Testing for Resilience requires overloading the system both from a load and from an attack perspective.
  • Resilience cannot depend on others working as expected. Maersk's network was devastated when all of the back-ups to their DNS was destroyed by a single malignant virus. They were only saved by the accident that one of the DNS servers was off-line due to a power failure.
  • Part of resilience is how each networked component works as a part of a system.[8] One example is a car that stalls out in a garage is of no consequence to anyone but the user. On the other hand if it stalls in the middle of an Interstate at rush hour it can impact the trips of thousands. The ability of the highway system to handle that stalled car is a measure of its Resilience.
  • The difficulty with measuring Resilience to faults is inherently non-deterministic. The wiki page on Self-organization describes several examples of the power-law distribution of impacts that are commonly low impact, but can unexpectedly result in major impacts, like earthquakes that are nearly imperceptible until the "Big One" comes and brings down all of San Francisco.

See page on Intelligent Design

References

  1. Peter Coy, How a Princeton Economist Teaches Resilience, New York Times (2021-09-27) https://www.nytimes.com/2021/09/27/opinion/resilience-princeton-economist.html
  2. 2.0 2.1 Markus Brunnermeier The Resilient Society (2021-08-23) Endeavor ISBN 978-1737403609
  3. Igor Livkov +6, Cyber Efficiency and Cyber Resilience CACM 66 No. 4, pp. 33ff. (2023-04)
  4. Nassim Nicholas Taleb, The Black Swan - The Impact of the Highly Improbable (2007) Random House ISBN 978-1-4000-6351-2
  5. Zeynep Tufecki, The Shameful Open Secret Behind Southwest’s Failure New York Times (2022-12-31) https://www.nytimes.com/2022/12/31/opinion/southwest-airlines-computers.html
  6. Farhad Manjoo, How the World's Richest Country ran out of a 75-Cent Face Mask. (2020-03-26) The New York Times p A22
  7. Justina Lee, Quant Stock that never could happen hits Wall Street Models" Business Week (2020-11-13) https://www.bloomberg.com/news/articles/2020-11-13/quant-shock-that-never-could-happen-hits-wall-street-models?cmpid=BBD111320_BIZ&utm_medium=email&utm_source=newsletter&utm_term=201113&utm_campaign=bloombergdaily
  8. Ted G. Lewis, The Many Faces of Resilience CACM 66 no. 1 p. 56ff (2023-01)

Other material

  • Wikipedia has a great entry on Ecological resilience which explains many of the interactions to be aware about.