Lauradoux Cédric, Equipe Privatics INRIA: Identifiers and Guesswork
Identifiers are found in every type of metadata that we create as well as in data management systems. Unique identifiers make it possible to connect the physical identity of the user to the digital world. It is therefore very important that we are careful in the design of these identifiers as they can lead to leaks in information. The main tool used to analyze a unique naming scheme is called "guesswork". This is a very important subject in security (password cracking), and cryptography (looking for and identifying the keys), and in protecting one's privacy. The complexity behind guessing the value depends on pre-calculation, the power of the calculation, and the memory at the disposal of the attacker, but most of all the distribution of probability associated with the value which is trying to be discovered. In this presentation, I will describe the problem of guessing every-day identifiers.
Hello, and thanks to Pascal for inviting me
So I work at INRIA, on the Privatix team, which is a team that works on privacy
Before we worked on security, and then a sort of consensus was achieved, and as various events in the digital world, we refocused our efforts on questions of individual privacy
So this field is pretty wide-ranging, because we are interested in any application which endangers the integrity of someone's personal data, whether it be online or within your own files, such as knowing at what time you get up in the morning, or by measuring your wine consumption whether your wife is pregnant or not
So my presentation will discuss identifiers, identifiers are something nearly essential in computing, they are used quite a lot, as soon as one speaks about a data base,
so these identifiers can be used in order to track users at moment they are used
so it is indeed the study of these identifiers which will be the focus of this presentation, and the question to consider is;
these identifiers can be generated from any number of sources of personal information; email address, last name, first name, a computer's MAC address...
and so those individuals who handle/process these identifiers, in order to make them anonymous, or to use them in their own data bases,
and so the question is, working from these identifiers, can we identify which information was useful for generating those identifiers by industry actors?
So to answer this question, we are going to look at a theory called 'guesswork'
this is a sort of scientist's jargon, a word you're not going to find in the dictionary, something even more difficult to translate into French,
I'm sorry I don't have a better translation
so we are going to delve into this way of thinking, guesswork,
after that I will give a brief presentation on theory, which I will apply to two concrete cases, so here we will apply guesswork to determine the 'Age of the captain', if the captain is French,
and then, the second case study that we will look at will deal with MAC addresses, which is an identifier often used currently to track people
So what is an identifier? I think I will go over this slide rather quickly,
so we've got entities that have a known number of attributes, we run these attributes through a grinder, and this gives Identifier 'y'
ok, so this is something we use often in everyday life, and of course in computing, so function F can be as complicated as it can be simple,
For your Social Security number, function F is simply the synthesis of your attributes, your gender, date of birth, your place of birth, and various other information,
and as we will see a little later, function F can be something mathematically much more complicated
identifiers, we would like for them to be unique, each being distinguishable from the rest, that is to say between two identifiers there exists one unlike attribute,
we would also like for the identifiers that are produced by function F to be different,
so once we have these types of properties, we are able to distinguish individual entities and accord them unique identifiers,
and so we will be able to reference them, and give a unique avatar/representation to a physical person,
this is what identifiers are used for most of the time,
and sometimes it happens that, in the place of having truly unique identifiers, they are pseudo-unique, with a collision probability equal to epsilon, a negligible function
Ok so in the next part of my presentation I am going to consider that we are working with unique identifiers, which is what we are looking at most of the time
as citizens what interests us is that these identifiers be 'private'
now what we find in Anglo-Saxon laws, 'private', translatable as 'non-significant / unsignifiable',
so in computing terms, we will define 'private' in terms of an identifier which is based on attributes that are computationally infeasible to recover
so typically in a data base, it is forbidden (without informing CNIL) to directly store one's data; last name, first name, and various other personal data,
however, an industry actor can in fact use private identifiers as long as it is impossible to recover attributes related to an individual from such identifiers
so, the purpose of this presentation is to delve into these identifiers, and we are going to see in fact that unfortunately, the majority of identifiers used in the industry and in use today are unfortunately not private
just to be thorough about all of the properties that can be expected of an identifier, I am going to use once again a word for which I do not have a French translation,
we can have identifiers which are 'accountable', so what does that mean?
it means that under certain conditions be able to recover the information
someone can answer for their own actions concerning this information, and so this is the idea behind 'accountable' identifiers
So on the one hand, we would like to have private identifiers, which give out no information on an individual, and on the other we would like to have identifiers which, under certain conditions, allow us to find certain information
so these are two facts which are totally contradictory, and unfortunately the European regulation on privacy rights which is currently being established, is steering in this direction; identifiers must be both private and accountable,
which presents many problems for those in the industry to create these types of qualities, as well as scientists, because it is a nearly impossible task
so what kind of evil can we be done with identifiers? 'Tracking', which means what - it means following you and your movements, learning what you like to do, and doing so with less than noble intentions,
for example learning what you like to do in order better sell you a new tennis racket, as you enjoy playing tennis,
be able to discern the fact that you are having an affair, and perhaps engage in blackmail,
there really are many possibilities
'tracking', an activity which saw a huge boom in 2013 in the digital field; here various companies realized that by exploiting information in WiFi protocols, that it was very easy to track individuals,
there are a number of programs in American shopping centers; once someone enters, their Smartphone sends out various signals, so for those who know about WiFi, it is the 'probe request',
so basically here a computer will send out requests with its MAC address to every WiFI network/SSID that it knows,
and if there is a response by an SSID that the computer recognizes, a connection will then be established
so using this mechanism, naturally a Smartphone will permit its user to be tracked in the shopping center, and thus it will be possible to know what sort of shopping someone does, and even what would be better in terms of organizing the shopping center in order to increase sales,
so this is really something that took off in 2013, there were quite a number of start-ups that were based on this type of activity, many of them in the United States, and we are beginning to see some in France,
so I would advise you to turn off your Smartphones while shopping
if not you are going to receive a lot of targeted advertisements from your local shopping center
So what can be done with identifiers? Dedicated malware can be created...there are quite a few unique identifiers in an operating system
and when one wishes to create malware that targets a specific country for instance, easily enough, based on equipment sold in such a country, it is possible to create a virus which affects only one geographic region...
so these are facts which are well known in the study of computer viruses/malware
something else malicious that can be done with identifiers is the anonymization of data
many of those in the industry think that using private identifiers is enough to anonymize a data base
not a chance in fact - this works very poorly, and we will see what effect this has on the privacy of everyday individuals a little later in the presentation
so just to give you an example of why I was particularly interested in exploring identifiers; on our team, we are working on a particular project called 'Mobilitics'
this is a collaborative project between INRIA and CNIL, which is focused on individual privacy and Smartphones
so more precisely we are interested in any information leak which happens during the running of Smartphone apps, on iOS and Android operating systems
Our goal, to study several Smartphone test users at CNIL, and thereby quantify the amount of information which is leaked from smarrtphones
in order to do this, we had to develop several applications in order to properly observe our test subjects
and to treat this question, we wanted to know what sort of information was being transmitted by Smartphone apps,
we noticed rather quickly that there were a large number of identifiers being used and which leaked lots of users' personal information
I won't go to much into detail on this project - if you are interested in finding out more about it, there is a website and blog dedicated to this project,
but just to sum up the story, by dissecting the RATP app, we realized that this app collected really a lot of personal information, the user's address book, GPS data, IMEI number, which was anonymized using a completely ineffective function F...
so it is indeed a valid subject; you might say that identifiers seem innocuous, but in fact there is a lot of stake for a number of French public organizations as well as those in the industry
so in order to create private identifiers, it is relatively easy to understand that F will need to have a number of characteristics...
the most obvious of which is that F be a one-way function
so it must be easy to calculate, but it must also be difficult to extrapolate
therefore in practice, when private identifiers are developed correctly, cryptographic hash functions are used
I have shown here some of the best known functions in commerce, MD5n SHA-1, SHA-3...
and generally it's thought that once function F is chosen, it's smooth sailing afterwards...a private identifier has been developed...
so how does this work? For example, with a unique identifier, granted that it is against the law to directly store this information, mainly the law of CNIL,
function F is applied, and the resultant private identifier can be stored in a data base,
there are a number of companies that do this, and so I have taken the liberty of directly copying this content from the company Navizon, a company which works on WiFi tracking,
and it essentially works using the MAC address, which has been classified by CNIL as personal information, which means it is illegal to directly store such information,
so what do they do? They apply a hash function
so the first thing that Navizon claims is that 'hashed data cannot be reverse-engineered'
ok granted it is difficult to reverse-engineer hashed data...everything to this point works out,
but be careful, because it continues to say that 'someone who is authorized or unauthorized, if they access the data base, and they look at the hashed data of the MAC addresses, they will only see long strings of numbers and letters, and would not have any meaning'
so what they are claiming, is that 'ok, so our job is done, we had a unique identifier, we applied a hash function, so no one can now find this information based on the hashed data'
so I intend to show you at the end of this presentation that finding a MAC address from hashed data from the company Navizon on a Smartphone takes about three seconds, it just takes a little effort and some thinking
So we can get started! we've got a private identifier, and we want to find the attribute which allowed us to generate it
so the two attack strategies are simple; either you can invert the one-way function F - I wouldn't recommend doing this, as the functions that I mentioned earlier were developed by pretty skilled cryptographers,
and therefore a full cryptanalysis on this type of function, doing an inversion, I would find it difficult to believe such a thing plausible, even if for one of them it could be possible,
however, what would be much more reasonable to attempt would be to try to guess the input,
so what is it to guess the input?
the goal is to say, we're going to perform some tests, we are going to list every possible value for the input, and then we are going to hash it, and once we have a match between the actual hashed data and the hashed data that we calculated, we have the input value!
so that is not very complicated in terms of the strategy, it is simply an exhaustive research, everyone knows how to do that,
so in doing this type of experiment, one tries to minimize the expected number of trials done, the quantity E(G)
It's not only the protection of individual privacy that is interesting for this idea of 'guesswork'
it is a very significant practice in cryptography, when systems are being designed, keys can be extrapolated, plaintext can be discovered, or random number generation sources can be identified,
so the important thing here with cryptography and guesswork is that data that needs to be discovered generally has a uniform distribution,
and for guesswork this is the worst case
the metrics used to measure the difficulty of applying guesswork in crypotgraphy, is generally Shannon entropy, or min-entropy
I will come back to these definitions a little later in the presentation, which will be needed,
Guesswork is used greatly in security for passwords
i doubt that anyone in this room uses truly randomly generated passwords, longer than 12 characters,
so here we have guesswork which deals with a distribution that is not at all uniform,
it is relatively easy to realize that there exist more and more websites that have security issues dealing with their passwords, and so when such sites are hacked, entire lists of the site's user passwords are collected, and a strong statistical bias in the choices of passwords becomes very clear
here I don't want to go too much into detail, we will see in the following slides, but there are many metrics to try to measure the strength of a password,
it is a subject oft discussed nowadays, to measure the strength of a password remains an open question
So now we're going to establish some definitions
I am going to tell you mainly about Shannon's entropy and min-entropy,
Shannon's entropy measures the quantity of incertitude for a random variable,
perhaps slightly lesser know, min-entropy, simply the log of the element with the lowest probability, so we can consider that to be the worst case scenario, the element which brings the least entropy to the system,
so how do these help as metrics? well with Shannon it's obvious; with Shannon it is the average quantity for the random variable, and min-entropy is the worst case
so I spoke to you before about exhaustive searches, which is what exactly?
something which does not require precomputation, requires no memory, but does require a certain amount of time, and so if we've got a number 'n' to test, on average it is n minus one over two
well we can improve this, we can make some compromises between the time and memory, make dictionaries, we will have more precomputations, more memory, but a shorter time when the direct search is performed,
and then, between what I have just presented, exhaustive searches and attack by dictionary, there is what people call the TMTO (time-memory trade-off),
which allow us basically to make a compromise between precomputation, memory and online search,
and the take-away message of all this is that between the number of attempts that we are going to have to perform and the memory, and if we define the work as the product of both of these, this result (usually) remains constant, whether it be for the TMTO, the dictionary attack or by exhaustive search
so all of these are techniques greatly used in cryptography and are applicable when trying to discover values which are uniformly distributed
now I will take the liberty of having a quick information theory break
I think that people who use hash functions, who try to develop anonymous (private) identifiers, have just forgotten a simple principle,
if with the function, having 'n' as the input, the output 'y', there is a clear relationship, which is the fact that I cannot more entropy at the end than I did at the beginning
it would seem rather simple what I just said; the majority of people who try to anonymize data aren't convinced of this
so that was our little "information theory intermission"
we will really see a little later, notably on the final slide, a function on its own cannot support entropy; in the worst of cases, it will diminish the entropy of the input
sorry for this subliminal slide on information theory
so I spoke to you just before about exhaustive search and the dictionary attack, which works well on uniform distribution, and in fact, something that I find really interesting, is to look at non-uniform distribution,
how can this situation be improved? I would just like to repeat: there is a real complexity of mounting an attack using exhaustive search by using the entropy function, 2 to the power of the entropy plus 1 over 2.
it is relatively easy to say that it be done better, with trivial distributions, we really do see that we can do better,
if the distribution is bi-modal, four elements for which the probability is zero, and four others for which the probability is uniform, obviously I know how to do it better,
it is simply a case of exploring this area here and not this one,
so that is a rather trivial example, and it would be more interesting to treat cases less trivial,
so when the values are non-uniform, the best strategy is to enumerate the possible values in order of decreasing probability
it's a really natural solution, it's the solution used in password cracking, the first password tested is '000'
after that is 'AZERTY', so we really start with the most probable possibilities
so if you want to calculate the average number of attempts, this quantity here, the sum of 'i' 'p-i', 'p-i' being the ordered probability,
the question now becomes, can this sum be less than what we had before?
so no luck there, because we know the lower boundary, but not the upper boundary,
this is a relatively old idea in information theory, which is given here,
I won't spend any time on explaining how this can be obtained,
in fact it is a result that has come to us from a connection between thermodynamics and information theory...and which has been obtained for a certain number of probability distributions,
so this is a pretty interesting result because now at least we know that the lower boundary is less than the boundary we had before,
well this isn't very satisfying yet, but at least we gain some hope that we are doing better than before,
and so the difficulty here is that the upper boundary, there really isn't any clear way to derive this value, it's been about twenty years since people have been trying to derive this, to little success
here's just an example to realize if it's more effective or not,
the result which is given here is true for geometric distributions, we know full well what that is,
we see here that the distribution is extremely biased, we can see that one element is much more probable than the others,
if we calculate the quantities, here we've got the average number of attempts on a geometric distribution, and there we've got the entropy of a geometric distribution,
and here on a curve, on the blue curve you can see the number of attempts employing a naïve strategy based on entropy,
and then here in red, you can see the curve obtained from numbering the possibilities in decreasing probability
so we don't really gain much, but we gain something
in practice, if we try to apply this to real distributions, so we're going to see how many attempts it takes, the difficulty to determine the age of the captain,
so in order to find the age of the captain, we are going to consider that the captain is French, so we go to the INSEE website, we get the age pyramid, which gives us the distribution, and with this, we can determine that today, determining the captain's age will require asking 40 questions
so we will start with the elements that have the highest probability,
the elements within the range 40 to 60 years of age, and then we'll go from there,
If we use an entropy-based approach, that will be 2 to the power of 6, divided by 2 to the power of 5 plus 1,
so that means it is above,
ok great, in practice we will reduce the number of attempts,
so we can have some fun by organizing the data in categories of age, so here we see that the younger you are, the fewer questions it takes to discover the age of the captain,
if we know that the captain is older than 70 years old, we know that it will only require 9 attempts to discover his age,
well the 'age of the captain' is just a fun example, now we're going to try to do something a little more realistic, a little more interesting, so we're now going to take a look at the MAC addresses,
which are indeed transformed by the company Navizon, which are hashed by the function MD-5,
and according to Navizon, once hashed, no one can discover any useful information,
so here the first observation is that a MAC address is a value of 24 bits, which is composed of two 24-bit fields,
one field is known as the 'OUI' and the other the 'NICS'
so the OUI is a manufacturer identifier (Organizationally Unique Identifier), and the NICS is left to the manufacturer to do with what they like (Network Interface Control-Specific)
if one were to use the classic cryptographic method of exhaustive search, 2 to the 48th, on average it requires 2 to the 47th, and that with a powerful graphics card, takes about a day using the password cracker 'Hashcat'
well one day is rather long, and not everyone has a powerful graphics card, so we are going to try to work a little smarter,
it is rather easy to realize that the OUI numbers (the 24-bit value), there aren't a power of 24 (of them), in fact on the IEEE website it is possible to collect all of the OUI codes which are in use,
the manufacturers Apple, Samsung, etc. are attributed OUI codes,
and so we see that it is pretty simple, we are not going to be dealing with a power of 24, but we are rather going to have 52 or so to deal with
so, that's good
that allows us to eliminate a certain number of values right out of the gate,
it seems that i have lost some slides, that's no problem,
and so by using the OUI distribution, we are able to transform this calculation, which would usually take a full day, into a calculation which now takes 3 seconds
and we no longer need a very powerful (and very expensive) graphics card, which not everyone has, to say a Smartphone, so that is a rather satisfying result,
when we discussed this with the folks at Navizon, and they insisted that it required a powerful graphics card, we showed them our Smartphones saying, 'this is all that was needed,' it was much more convincing, in terms of plausibility - no one could say that we used limitless resources/capacities to hack their system
so, the question now becomes, as we see that with a little effort and reflection, we can overcome the calculation challenge,
now the question becomes, how do we build robust private identifiers?
as I have just showed you that just taking a hash function and hashing a MAC address is not a good idea
and those that do this, I can tell you that there really are a lot of them, in numerous French companies, all of which the NDA that I signed prevents me from naming
so, what can be done?
first of all, we've got to try to really understand the law of probability which is in effect with identifiers, which are given at the input of the function,
so that is the first thing to study
so as to have an idea of the complexity that a hacker has to negotiate in order to find this identifier,
and the only possible solution, is to add another value which in itself will be very difficult to guess
and it would be ideal that this value respects the law of uniformity,
because it is the worst case scenario for any hacker,
so what we are going to add in fact is entropy, we want to use random numbers and add them into the input of the function,
so the diagram that I am explaining to you, it is in fact the scheme used to protect your own password
I don't know if you have ever looked at the functions used for passwords; SEL is used, and SEL is not designed to make a password secret, but to impede attacks by dictionary, but in principle it is very similar to the diagram that I am explaining here
functions like bcrypt, scrypt, are the types of things that can be used to create private identifiers,
so you might try to tell me, 'adding random numbers', if you take a moderately skilled engineer and give him free reign, he wil probably use the time function;
this is just an observation, but very often people add the time, expressed in seconds or milliseconds, normally it would work...
well that actually happened to us, on a couple of applications, programmers added Posix time, the amount of time in seconds since the year 1970,
this is a 32-bit value, so at the best we can hope that this would a power of 32 to the difficulty of guessing the value,
the problem here though is that first of all the value is not necessarily 32, the value in question here is, if the identifier has been recently generated, the certitude you can have on the value is simply the number of seconds passed in the last year, of which there is 2 to the 25 seconds in a year,
so 2 to the 32 to 2 to the 25, we have already eliminated a number of attempts for any potential attack,
so what you must understand, you might want to point out to me that I am just shaving off bits of the bigger problem, a power of 7, a power of 18, here and there,
so for those that work in performance, getting a power of 18 it is really incredible, for cryptographers it's not necessarily very impressive,
what is interesting however is to see how we can begin to try to save a maximum on a first value, and then a second value, all of that adds up,
and for attacks on values of 128-bits, to the 128th power...once we go beyond a power of 80, there isn't a machine (in the public sphere anyway) that is capable of calculating at such a level,
I'd like to leave that open anyway, as perhaps there are those who are better at this than us, and who really have a lot of money,
128-bit identifiers really are of a size that is difficult to solve, such a size should be sufficient to prevent a breach
and indeed there really are a lot of identifiers, which are developed from MAC address, so 48-bits of a MAC address, so we can't really transform that into much,
time, 32-bits which isn't really much either, and the two or three other quantities which are easily predictable and which follow a highly-biased distribution,
and with 128-bit identifiers, it is very easy to extract the MAC address and the moment at which they were generated,
so it's really important to have this attitude of 'winning every little bit' on each value, because it's what helps to short-circuit attacks when dealing with the combination of various elements
so to make truly private identifiers we need truly random numbers
is there anyone in the room that can give me a truly random number?
(**no response - silence**)
so this is the single problem in cryptography, everything rests on the random number, and when we ask someone to produce one, we have a problem,
this problem of random numbers and private identifiers is something very important for personalized medical files, because in these files there is something called the 'INS', there is the 'INS C' and the 'INS A',
so the INS A is an identifier which is used to identify you in the hospital setting, ans is calculated from a wide array of personal information, notably one's social security number, and a whole other slew of identifiers,
so you can imagine that this type of identifier would be hashed by a cryptographic has function, but no luck there, this type of identifier has no security, no privacy, as we are able to, working from the identifier, find all of the related information,
and in fact the objective behind the personalized medical file initiative, is to replace this method with one that employs random numbers,
so the calculative method worked well, because working from your health card (carte vitale), we were able to discover your identifiers, and your general practitioner, in their database, will have these identifiers,
if we have to use random numbers, it would mean that there would need to be a server somewhere in France that the doctor would have to access in order to obtain your own identifiers,
at the moment however, based on the latest discussions that I have had with people working on the personalized medical file project, no such server exists, and that we are still working with calculated identifiers,
so there you go - there is the conclusion of my presentation on identifiers, if you have any questions, I will try to answer them