Resilience

From MgmtWiki
Jump to: navigation, search

Full Title or Meme

Resilience of any complex ecosystem is the capacity of an ecosystem to respond to a perturbation or disturbance by resisting damage and recovering quickly.

Goal

  • In Identifier and Access Management systems resilience must be framed as the ability for any user to access their resources safely and with minimal complexity at a reliability level that is higher than some specified minimum.
  • Safely refers to access of records without exposing them to unauthorized users or corruption of data.
  • Minimum complexity is entirely determined by the user, but may be different from one user to another.
    • A healthcare patient's complexity must be handled by a minimum fraction of the population of the population at some level of educational ability, say completion of 8th grade.
    • A healthcare physician's complexity may also considered the cost of false positive in determining the level of complexity.
  • The Reliability will be calculated as the probability of getting access at any time and must be in the range of 99.9% to 99.999% which may be dependent on the criticality of failure to authenticate in time to preserve life and property.

Context

An oak is strong. A reed is weak. But in a terrible storm, the oak is uprooted and the reed survives. Markus Brunnermeier, an economics professor at Princeton University, takes that metaphor from the French poet Jean de La Fontaine as the theme of his new book, “The Resilient Society.” His message: Like the reed, we must bend, not break.[1][2]

  • On 2023-03-15 a new presidential initiative was announced.
  • Back in 2015, the White House introduced The National Strategy for Trusted Identities in Cyberspace (NSTIC), an initiative collaboratively bringing together the private sector, advocacy groups, public sector agencies and other organizations to improve the privacy, security and convenience of online transactions. The Identity Ecosystem envisioned in the NSTIC is an online environment where individuals and organizations are able to trust each other because they follow agreed-upon standards to obtain and authenticate their digital identities – and the digital identities of devices. To achieve this objective, the NSTIC established guiding principles for the creation of an Identity Ecosystem, developed with identity solutions that are:
  1. Privacy-enhancing and voluntary,
  2. Secure and Resilient,
  3. Interoperable and
  4. Cost-effective and easy to use.
  • Used as source of the principles for the Identity Ecosystem Framework (IDEF).
  • Unfortunately the IDEF itself was not resilient and failed to accomplish its mission. When the administration turned over, it just vanished. The take-away is the presidential commission's themselves are not resilient.

Problems

  • It seems to be a feature of any component of a Living System, (which includes all of societies' imposed structures) that the most successful systems migrate towards solutions which make for the most efficient use of the resources at their disposal. For the system as a whole to be Resilient, the inevitable failure of any subsystem that is highly leverage, but not imperil the whole system, or it will not survive change. In other words, in resource rich times, the efficient organism will have an advantage, but in times of wildly varying resources, the Resilience of the organism will be more important.[3]
    [The report] showed that resilience pays off. It is likely adding resources for resilience initially increases its costs without expanding functionally, causing an initial decline in short-term efficiency. However, cyber disruptions are increasingly likely (if not certain), which decrease the system's functionality and simultaneously increase its costs due to lost customers or users, lawsuits, and other damage. A system that prepared for resilience has lower declines in functionally and fewer cost overruns, and this advantage can more than compensate for the initial cost of adding resources.
  • The size of change most likely follows a power law, or the small changes are more frequent than the larger changes. If a system is resilient only to small changes, the the large changes will imperil the system.[4]

Examples

  • Square outage leaves sellers unable to process payments (2023-09-08) Square suffered a major systems outage that has left sellers unable to access their accounts or process payments for more than 12 hours. If you have a restaurant that uses square for all payments process, this outage is catastrophic. If you save the old zip/zap card machines as a back up, the new cards do not have embossed information. As technology expands we all become more vulnerable to events like this.
  • Extreme weather always tests the Resilience of systems on which the public depends. At the end of 2022 a storm that started in the Northwest of the US and wound up in the Northeast caused major disruptions to airlines. Most of the airlines came back into operation within a day or two, but Southwest had designed for efficiency rather than resilience. It would appear that all of the savings of efficient operation were wiped out by the weeks of failure to meet their committed flight schedule caused chaos. As the New York Times reported: [5]
    This problem — relying on older or deficient software that needs updating — is known as incurring “technical debt,” meaning there is a gap between what the software needs to be and what it is. While aging code is a common cause of technical debt in older companies — such as with airlines which started automating early — it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand. As you might expect, the former is cheaper and quicker.... if you are a corporate executive whose compensation is tied to stock prices and earnings statements released every three months, there are strong incentives to address any immediate problem by essentially adding a bit of duct tape and wire to what you already have, rather than spending a large amount of money — updating software is costly and difficult — to address the root problem. Then you can cross your fingers and hope that whatever catastrophe may be in the making, it erupts under someone else’s future tenure. Such bets often pay off since, increasingly, the plight of a company’s customers and employees is divorced from the immediate fortunes of its current top executives.
  • We cannot expect the systems that brought us to this society of high efficiency and low resilience to adapt to a system of high resilience, even if the quarterly profits do not always meet expectations. Even Einstein realized that "The thinking that got us to where we are is not the thinking that will get us to where we want to be.".
  • An example of a big changes brought about by the COVID-19 virus in 2020 was caused by United States Capitalists move to off-shoring manufacturers that involved significant amounts of manual labor as well as the just-in-time logistics theory which meant that any inventory was just unused capital. One example was the manufacture of the face masks that were critical to the health of the working combating the virus. In the mean-time the Trump White House had eliminated the disease experts in the National Security Office. The result was "A very American story about capitalism consuming our resiliency.[6] Both of these efficiencies made the country susceptible to the shortage of many clinical components, as no planning or control over the recovery of that capability. Note that the was a strategic inventory of medical supplies, link masks, but that it was depleted in the H1N1 virus emergency in 2009 and was never replenished.
  • During the reign of Jack Welch at General Electric the company prospered wildly as a result of applying vulture capitalism principles at every level of the company. Welch retired a hero. The subsequent near-total collapse of the company seems to not have been his fault, but any student of planning and control knows that optimizing for only the short term effects will eventually lead to a situation that was not planned for and cannot be controlled.
  • In 2020-11-13 the Phizer company announced a new COVID-19 drug that caused "an event that statistically never could happen"[7] but it did and all the models built to enable resilience were unable to recover. What's worse, all of the stock traders and statisticians (quants) that made money all those years with no "claw-back" provisions, learned the wrong lesson and will just go off and do it again.
  • In identifier and access management problems can be introduced by attacks which cause loss of access, so both the likelihood of loss of access and the time to recover access must be determined to the extent possible.

Solutions

  • In the end, each system must determine the level of efficiency and resilience that it desires. Too much caution will miss out many small changes that occur every day. Too much recklessness will result the the inevitable failure in the long term.

IAM Principles of Resilience based on Gartner

  1. RISK CULTURE - Stop focusing on checkbox compliance, and shift to risk-based decision making.
  2. OUTCOME FOCUS - Stop solely protecting infrastructure, and begin supporting business outcomes.
  3. BETTER FACILITATE- From defender to facilitator balance protecting with delivering business outcomes.
  4. MAKE WORKFLOW - From trying to control information flow to understanding how it flows and risks.
  5. PEOPLE-CENTRIC - Accept the Limits of Technology and Become People-Centric.
  6. INCLUSIVE - Ensure that the greatest good for the greatest number applies to the entirety of society and just just large corporations.
  7. DETECT RESPOND - Stop striving for 100% protection, and invest in detection and response.

Gartner’s researchers predict by 2017 50% of IT spending will occur outside of traditional IT department control. They note we are at an intersection of two extraordinary digital trends. These include the ongoing transformation of digital business and the ever-growing capacity and capability of adversaries..

Avoiding Risk

It might seem paradoxical, but risk avoidance does not lead to resilience. Markus Brunnermeier[2]argues that resilience can serve as the guiding North Star for designing a post-Covid-19 society, Risk is not to be avoided. It’s only by taking risks that society achieves breakthroughs. And a society that doesn’t take risks becomes fragile. “Perhaps paradoxically, enduring a small crisis from time to time can be preferable to avoiding them at any cost. A crisis is an opportunity to make needed adjustments."

Resilience in Complex Adaptive Systems

Operating at the edge of failure - The scientific basis of resilience

Systems are complex..

  • Unexpected behaviors
  • Unexpected responses to interventions
  • New forms of failure
  • Changing (in obvious and not so obvious ways)

Operators are continuously...

  • Monitoring some parts of the system (but never all of them)
  • Exploiting opportunities
  • Estimating the distance to failure (if they are not already operating under time constraints)
  • Reacting to threats (usually because of some outside information)
  • Anticipating future conditions (the human brain is like a time machine, looking backward and forward in quick succession)
  • Learning system features (typically by experience, but hopefully by study)

What are the Boundary Conditions for successful operation - Modified from Rasmussen, 1997

  • Accident Boundary - cross this and the is a problem that well could be existential - unfortunately no one knows where this is
  • ECONOMIC FAILURE Boundary - cross this and the business fails
  • UNACCEPTABLE WORKLOAD Boundary - cross this and everyone quite


Richard Cook, MD Professor Of Healthcare Systems School of Technology & Health Royal Institute Of Technology Stockholm, Sweden www.ctlab.org https://www.youtube.com/watch?v=PGLYEDpNu60

Design and Test

  • Most design looks only at the common use cases.
  • Design for Resilience requires use cases that are at the edge of performance and attack from malicious or untrained users.
  • Testing for Resilience requires overloading the system both from a load and from an attack perspective.
  • Resilience cannot depend on others working as expected. Maersk's network was devastated when all of the back-ups to their DNS was destroyed by a single malignant virus. They were only saved by the accident that one of the DNS servers was off-line due to a power failure.
  • Part of resilience is how each networked component works as a part of a system.[8] One example is a car that stalls out in a garage is of no consequence to anyone but the user. On the other hand if it stalls in the middle of an Interstate at rush hour it can impact the trips of thousands. The ability of the highway system to handle that stalled car is a measure of its Resilience.
  • The difficulty with measuring Resilience to faults is inherently non-deterministic. The wiki page on Self-organization describes several examples of the power-law distribution of impacts that are commonly low impact, but can unexpectedly result in major impacts, like earthquakes that are nearly imperceptible until the "Big One" comes and brings down all of San Francisco.

See page on Intelligent Design

References

  1. Peter Coy, How a Princeton Economist Teaches Resilience, New York Times (2021-09-27) https://www.nytimes.com/2021/09/27/opinion/resilience-princeton-economist.html
  2. 2.0 2.1 Markus Brunnermeier The Resilient Society (2021-08-23) Endeavor ISBN 978-1737403609
  3. Igor Livkov +6, Cyber Efficiency and Cyber Resilience CACM 66 No. 4, pp. 33ff. (2023-04)
  4. Nassim Nicholas Taleb, The Black Swan - The Impact of the Highly Improbable (2007) Random House ISBN 978-1-4000-6351-2
  5. Zeynep Tufecki, The Shameful Open Secret Behind Southwest’s Failure New York Times (2022-12-31) https://www.nytimes.com/2022/12/31/opinion/southwest-airlines-computers.html
  6. Farhad Manjoo, How the World's Richest Country ran out of a 75-Cent Face Mask. (2020-03-26) The New York Times p A22
  7. Justina Lee, Quant Stock that never could happen hits Wall Street Models" Business Week (2020-11-13) https://www.bloomberg.com/news/articles/2020-11-13/quant-shock-that-never-could-happen-hits-wall-street-models?cmpid=BBD111320_BIZ&utm_medium=email&utm_source=newsletter&utm_term=201113&utm_campaign=bloombergdaily
  8. Ted G. Lewis, The Many Faces of Resilience CACM 66 no. 1 p. 56ff (2023-01)

Other material

  • Wikipedia has a great entry on Ecological resilience which explains many of the interactions to be aware about.