Bayesian Identity Proofing
Bayesian Identity Proofing with Deep Learning
Tom Jones 2017-11-01
There are many instances where some entity on the internet needs identification of the party that is trying to initiate an interchange. By identification we mean the collection of identifiers, attributes, behaviors and inferences that are likely to be collected by any Web Site, which will be the focus of this paper. Sometime it is sufficient for the identification to survive only for the duration of the interchange; in which case an HTTPS connection will create a suitable identifier. In other cases strong proofing of the real world identity of the party is needed where there is large risk of loss. This paper will describe the application of Bayesian probability theory and deep learning to the problem of identity proofing at whatever level of assurance the Web Site desires, not just the 3 or 4 levels defined in Federal documents. As seems to be the case with many applications of Bayes theorem, the value of many ID “improvements”, like high complexity passwords, is called into question.
In Bayes’ theorem  the proof of an identity (like any hypothesis) can be stepwise emended with the addition of more information whenever the proof value of that new information is known with some level of certainty.
In Deep Learning the breakthrough concept was back propagation.
While most discussions about authentication of an identity treat it as a single event, the actual process of authentication of a user at a Web Site is a series of events, each emending the probability of sufficient knowledge of the identity to allow the next step to occur. While there are as many reasons to acquire knowledge of the user as there are web sites, we will look at one use case; the provisioning of a valuable digital resource to a user. The following list is one possible stream of events leading to the release of the resource to the user in the face of malicious parties seeking to acquire the asset without paying the price required. In this case the resource owner needs to know that claimant is not known to be in the business of illicit resource acquisition, which has previously been determined to be the greatest threat facing the resource owner.
- The user contacts the Web Site and an HTTPS connection is established. The resource owner now knows the approximate location of the user and has an IP address that can be compared to an existing list of suspect IP addresses. Also the HTTPS connection assures the resource owner that the same user will be survive with this consistent connection identifier for the duration of the connection. (If a man-in-the-middle attack is in progress the “user” is the mitm.)
- The user selects a resource to be acquired after a period of browsing. The length of time and effort to get to the resource now known to the resource owner.
- The user supplies a name or email address to the resource owner who knows how often authorized users, unauthorized users and malicious attackers are known to come to the site. (The order of 2 and 3 is probably immaterial.)
- Depending on the authorization needs of the resource selection, the authentication process asks for a number of attribute fields from the user. In modern protocols, the user is permitted a choice about which fields to supply. That choice is often overridden by regulations or employment agreements.
- Continuing the process of acquiring sufficient identification information the Web Site can either go back to the user, or use consent acquired from the user to seek additional information from other attributes providers, for example state id or benefit providers. In the environment in late 2017 there is little reason for the site to seek any user consent other than the broad consent provided by the user clicking a button labeled “OK” for “I Agree” at some point in the process concerning some document that they have almost certainly have not read even if the site requires the user to scroll through the document.
- Once attribute sufficiency or exhaustion is reached, the Web Site will authorize access or terminate the connection. Typically it will be the user that will abandon the connection first.
We will represent the probability of a trustworthy identification of the initiating party as Ptrue. It should be clear that this number will always be less than one.
We will represent the identity information known about the party as a vector f IN = (i1, i2, … in).
Where there is a truth function f with weighting parameter for each assertion i such that Ptrue = f IN.
For Bayes’ theory to work we presume some initial probability Ptrue = f I1 = some constant, say 50% as the probability that an initial identity assertion is true. This is meant as a measure of the likelihood that the party is telling the truth about themselves. We will later see that this, like all weighting parameters of the truth function, is subject to emendation over time and experience.
We expect that each Web Site will set an acceptable level of proofing Pmin of each claimant that contacts the site. Note that his level of proofing can be different for different component elements on the site and can be varied for each claimant typically on some schedule since the last time the claimant had been proofed to a higher level. It should be clear that when Ptrue > Pmin then the claimants assertion is accepted by the site. Until that is true the claimant will need to provide additional assertions.
We consider two types of assertion events (e) given a party P and a vector I of length N.
- The party’s identity is confirmed as true with probability t.
- The party’s identity is reported as fraudulent with probability f. This event may happen a long time after the connection is terminated, possibly as a result of the discovery of illicit use of the resource. That implies the need for logging the release of resource (a license) and the ability to tie later misuse of the license to the authentication steps used in acquisition of it.
- It is also possible that the sequence of authentication is prematurely terminated. While that does convey some knowledge that could be treated like #2 above, it is not examined in the paper.
Case 1 is the classical Bayes function. Although a recalculation of the total truth function will also work if it is easier, this function shows how the Web Site’s knowledge of the identity is emended at each step.
Pc = probability of e given the current value of Pold. Pnew = Pold x Pc / t.
The entire change in the probability at each step is determined by the term Pc/t, so we need to understand the environment where this value is calculated. Simplistically if probability of e give the current value of Pold is not dependent on Pold, then the change term is 1 and there is no impact of that event on the new probability that the identity is trustworthy. Another way to put this is that the probability of the event occurring in the total population needs to be significantly lower than the probability of the event given the prior tests that went into the value of Pold. Let’s consider the specific case of the correct entry of a password of complexity sufficient to make the probability of entering it correctly in 3 tries without knowing it to be .001, in other words, not very complex. If the probability of the user getting it wrong 3 time is a low is also .001, then the value of the entry of the correct entry password given the general population of 100 million is:
Pc/t = .999 / .001 or 999, applied to a 100 million (probability 1/100 million) then Pnew > .99999 pretty much independently of whatever Pold might have been.
Which makes the case that a low complexity password is pretty good protection where password lockout at 3 tries and non-obvious password selection are both applied. Of course if password lockout is not applied then the discovery of your password is certain given sufficient time. The more complex the password, the less likely it is that the malicious party will bother to try a “brute force” attack against the password. That means that the security of any password is entirely within the capability of the authenticator in protecting it from disclosure and helping the user select a good one. This makes that case that allowing web sites with unknown security capabilities to let you select a password for them to protect is empirically known to be a bad choice.
The opposite case where password complexity is designed to “improve” security seems to add very little additional security where the authentication has good security as the primary attacks to web sites today is by malicious parties that have acquired the passwords by other means, such as the Equifax breach of details on 145 Million people, roughly one half of the people likely to be on the internet in the United States. This increases the likelihood that an attacker, just as an authorized user, is highly probable to get the password correctly entered on a high value site. From the equation above, as well as from common sense in this case, the correct entry of a password on a high value site has a probability close to one, and hence does very little to improve the likelihood that the user identifier is trustworthy.
However, the psychology of the level of protection is important in the sense that today users are known to choose passwords that are easy to guess. That makes password look-up tables widely used in attacks. Password complexity rules do help to reduce the occurrence of simple, easy-to-guess passwords, but at great cognitive load on the user. In order to discourage attackers, they should believe that password guessing is not likely to succeed. That does put some burden on the authentication sites to screen out “bad” passwords, whatever that might mean. One approach is to scan the dark web for such tables and use those tables to screen password selection. Another approach would be to find authentication methods that avoid user remembered secrets altogether. That approach would certainly lower the cognitive load on the user. However, that approach has been tried in the past without much success. In the case of web sites where only a small level of assurance is required, password are not likely to be supplanted.
The last point to consider is the collection of multiple authentication “events” to increase the trustworthiness of the identity. Given that if the authentication site chosen by the Web Site has trustworthy security, random attacks against it are unlikely as the likelihood of success is low. In that case the best authentication “events” are those with low probability of success (that is when the t in Pc/t is low.) That is a mathematical way of saying that the most information is carried in messages that are the most hard to predict or “surprising”. Currently sites use multiple attributes as increasing the probability that an identification is trustworthy. Getting a sequence of data that has all been released (for example) in the Equifax breach, is not surprising. Since if the attacker has the password, they likely have the other user attributes as well. Several sites, including NIST promote the concept of Knowledge Based Authentication<ref Hastings, N. E. Quantifying Assurance of Knowledge Based Authentication . Proceedings of the 3rd European Conference on Information Warfare and Security. 2004</ref> which is just such a collection. Where such knowledge attributes are reused they, like one-time pads, lose their value when any sort of leakage occurs. In other words, the first factor in multifactor identification, “What you know” may continue to work for low assurance identification, but it is inadequate for high levels. There are some attributes that remain hard for attackers to acquire and hence continue to have low probability of attack that will help proof identities. These attributes belong to the other factors of authentication: “What you have” and “What you are”. Extensive efforts at smart cards and biometrics have not been successful in broad deployments to-date because of the unacceptable load that they place on the user. It is imperative that new acceptable authentication factors are deployed. The most promising effort in “What you have” at this stage is to enable FIDO U2F.
It is expected that the third authentication factor, “What you are”, will also need to develop. It should be clear that physical biometrics are not the only way toward a strong identification. The users’ behaviors, medical conditions and other self-reported attributes can easily be added to the vector of user attributes to be evaluated by an identity ecosystem. Each attribute will add to the knowledge collected about the user and help to proof their identification.
Indemnification concerns have been a huge blocker to the adoption of strong authentication by authentication servers.
Case 2 is where an incorrect identification occurs. Whether the incorrect identification is malicious, or not, it needs to be treated as an attack and mitigated. If possible, the attack vector should be determined, preferable by machine detection but with human assistance where needed. If the attack vector is known a targeted update of parameters is possible. If the vector is not known, a generalized back propagation technique can still be applied. The real problem with responses to attacks in machine learning is that the recognition of the attack might occur sufficient long after the authentication was proofed to be trustworthy that the authentication learning algorithms may have already changed the state of the machine.
Machine learning has been focused on the individual machine faced with a fairly constant problem to solve. That that is not the way human learning works. People’s problem set varying over time and they work in a group with other people to solve problems. Then the solution is passed from one person to another and from one generation to the next. Given the constantly changing environment in which people need to be identified during their varying activities and problem sets, any identity ecosystem must be able to accept inputs asynchronously and to share what they have learned with other systems and other generations of systems. Even more importantly the identity ecosystem must be able to report problems that it has not been able to resolve so that new solution designs can be sought. Two specific ecosystems come to mind where a common identity solution is not possible is the medical ecosystem and the governmental ecosystem even within one federal system like the (partially) United States.
Privacy has not be addressed here-to-fore, but will be critical to the acceptance of an identity ecosystem. In the body of this paper the case for specialized authenticators has been repeatedly emphasized. Privacy is yet one more reason for limiting user attribute collection to a small number of organizations that are willing and able to protect those attributes. That will at least serve to make verification of privacy compliance more likely. Just like the “To big to fail” label on large financial institution, some sort of “To big to fail” label should be placed on any organization that collects and disseminates user data. Organizations that will not accept the label and the responsibility it carries should be proscribed from user data collection and dissemination.
- Wikipedia, 2017 Comparison of methods. Retrieved from Statistical Proof: https://en.wikipedia.org/wiki/Statistical_proof
- S. Cowley, 2.5 Million More People Potentially Exposed in Equifax Breach 2017-10-02 New York Times https://www.nytimes.com/2017/10/02/business/equifax-breach.html