Wednesday, January 23, 2008

What's the probability of at least two persons among 20 to have the same birthday?



I'm an anti-virus engineer. Recently, we've detected a mobile phone worm that spreads by attaching itself to MMSs that it sends to various phone numbers. Our contact at the mobile phone operator harvested about 1000 MMSs. Some MMSs were sent to the same phone numbers.

The question was: does the worm generate the phone numbers randomly or are the phone numbers hard-coded in its code?

A related question is: IF the numbers are randomly generated, which is the probability to have several identical numbers among a set of 1000 numbers?

The variable part in the 11-digit phone numbers consists of 5 digits. We assume that the phone numbers are equally probable.

There are 10^5 equally probable possible phone numbers. Let us consider each phone number as a letter of alphabet that contains 10^5 letters. There are (10^5)^1000 distinct 1000-letter strings that we can write in this alphabet. (To convince yourself, think that the alphabet is our normal 10-digit decimal system and we want to build a 3-letter string. How many distinct strings can we build? 1000. 000, 001, ..., 009, 010, 011, ..., 019, 020, ..., 999.)

How many strings do we have such that each string contains only distinct letters? In order to answer, think how to build such a string. You pick the first letter from the 10^5 available. Then you pick the second from the 10^5-1 available, etc. If you remember your combinatorics classes, then you'll recognize the "variations". How many variations of 1000 elements from a set of 10^5 are there? (10^5) * (10^5 - 1) * ... * (10^5 - 999).

To summarize, we have an alphabet of 10^5 letters, they are equally probable. We have (10^5)^1000 possible 1000-letter strings. These strings are also equally probable. We have (10^5) * (10^5 - 1) * ... * (10^5 - 999) equally probable strings such that each string contains only distinct letters.

Thus, the probability to pick a randomly build a 1000-letter string that contains only distinct letters (i.e. the probability that all phone numbers among the 1000 harvested phone numbers be different) is

(10^5 / 10^5) * ((10^5 - 1) / 10^5) * ((10^5 - 2) / 10^5) * ... * ((10^5 - 999) / 10^5) = 0.0066594036

The probability that at least two numbers are identical among the 1000 numbers is 1 minus the probability computed above, i.e. 0.9933405964.

As I was not very sure of my reasoning, I wrote a small simulation:

#include <stdio.h>
#include <stdlib.h>

#include <strings.h>
#include <unistd.h>
#include <time.h>
#include <errno.h>
#include <signal.h>

#define DFLT_N 100000
#define DFLT_K 1000
#define DFLT_MAX_ITER 1000000


static
unsigned int max_iter, iter, equals, N, K;

static
unsigned int *a;

static
double
uniform_distrib(double a, double b) {

return
a + (b - a) * ((double)random() / (double)RAND_MAX);
}


static
void
onUSR1(int dummy) {
printf("%f\n", (double)equals / iter);
}


static
void
parse_cmd_line(int argc, char *argv[]) {

int
c;
char
*endp;
extern
char *optarg;

while
((c = getopt(argc, argv, "hN:K:i:")) != EOF)

switch
(c) {
case
'?':
/* error */

break
;

case
'h':
/* help message */

break
;
case
'N':

errno = 0;
if
(optarg == NULL) {

/* error */

}

N = (unsigned int)strtoul(optarg, &endp, 0);

if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
case
'K':
errno = 0;

if
(optarg == NULL) {
/* error */

}


K = (unsigned int)strtoul(optarg, &endp, 0);
if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
case
'i':
errno = 0;

if
(optarg == NULL) {
/* error */

}


max_iter = (unsigned int)strtoul(optarg, &endp, 0);
if
(errno != 0 || *endp != '\0') {

/* error */

}

break
;
}

if
(N == 0)

N = DFLT_N;
if
(K == 0)

K = DFLT_K;
if
(max_iter == 0)

max_iter = DFLT_MAX_ITER;
}


int

main
(int argc, char *argv[]) {

unsigned int
i, picked;
struct
sigaction act;

parse_cmd_line(argc, argv);

srandom(time(NULL));

if
((a = malloc(N * sizeof(unsigned int))) == NULL) {

/* error */

}


act.sa_handler = onUSR1;
act.sa_flags = 0;

sigemptyset(&act.sa_mask);
sigaction(SIGUSR1, &act, NULL);

for
(iter = 0; iter < max_iter; ++iter) {

bzero(a, N * sizeof(unsigned int));
for
(i = 0; i < K; ++i) {

picked = (unsigned int)uniform_distrib(0.0, (double)N);
if
(a[picked] == 1) {
++
equals;

break
;
}

a[picked] = 1;
}
}


onUSR1(SIGUSR1);

return
0;
}



It can be used to answer the question in the title. The simulation says 0.411308. The "theory" says 0.4114383836.

If we were to be pedantic, the fact that we make some observations coherent with the predictions of a model does not imply that the model is indeed true, especially when we can easily imagine other models that make the same predictions. Thus, we called three distinct phone numbers out of the 1000 harvested. Two of them were not allocated. But this second method looks like the ultimate solution to measuring the height of a building with a barometer.

No comments: