Birthday Paradox

Copyright © Karl Dahlke, 2022

A high school math class has 27 students. The teacher asks each student in turn for his or her birthday (just the day not the year), and writes them on the board. Soon a duplicate appears. Yesw, two students out of 27 are likely to have the same birthday, even though there are 365 days to choose from. How is this possible?

This is called the birthday paradox, though it is not as mysterious as the word paradox might suggest. 23 people gives you an even chance of finding a duplicate birthday. A few more, and the odds are definitely in your favor.

The math is easy if you turn the problem around. What are the odds that every birthday is unique - that there are no duplicates? The first one is free. The second person has a new birthday with probability 364/365. The third person has a new birthday with probability 363/365, then 362/365, and so on. Multiply these fractions together and the odds steadily decrease. By the time you get to 343/365, the ratio is 49.27%. With 23 people you have a 50 50 shot at finding a duplicate birthday. Move up to a class of 27 and there is a duplicate birthday with probability 62.7%.

If you have a roulet wheel with n different numbers on it, so that each spin of the wheel selects a number at random, you need to spin the wheel approximately sqrt(n) times to get a duplicate.

A related question: how many people do you neeed to see all the birthdays? Clearly you need at least 365, and intuition says you'll need a lot more than that. In fact you need about 365 times log(365), for an even chance of seeing all the birthdays. This is about 2,153 people.

The math is rather complicated, and beyond the scope of this book. I'll run through the first couple of cases, and perhaps we can see the pattern.

Let's illustrate with marbles. There are k marbles in a drawer, of different colors, you pull one out at random and put it back; what is the expected number of trials to see all k marbles?

One marble, it's green, one trial, you're done. k = 1: expected value = 1.

Two marbles, first one is red (let's say), 2 trials reveals the green marble with probability one half, green marble appears on the third trial (red red green) with odds 1/4, green marble on fourth trial with odds 1/8, and so on. Bring up a calculator, like bc, in fact you can paste this right into bc (don't forget to say scale=5):

2/2 + 3/4 + 4/8 + 5/16 + 6/32 + 7/64 = 2.859.

The actual answer is 3, but let's prove it. We want the sum, starting at n = 2, of n times ½ to the n-1. This is the sum, starting at 2, of nxn-1, evaluated at x = ½. Now for a beautiful calculus trick! The sum of nxn-1 is the derivative of the sum of xn. Mathematicians have died and gone to heaven so we can exchange differentiation and summation, even for an infinite sum. The sum of xn from 0 to infinity is 1/(1-x). Differentiate 1/(1-x) and get 1/(1-x)2. That will equal the sum 0 + 1 + 2x + 3x2 + 4x3 + 5x4 … Substitute x = ½ in 1/(1-x)2 and get 4. But n starts at 2, so subtract away the lead terms, 0 + 1, and get 3. There you have it, you can expect to see both colors by three trials. Sometimes 2, sometimes more than 3, but usually 3. k = 2: expected value = 3.

Move on to three marbles. The last color seen, at trial n, is, without loss of generality, green. It was all red and blue before. For shorthand let p = 2/3 and q = 1/3. Has to be blue or red up to n so pn-1. But we also didn't see just red or just blue, so subtract away twice qn-1. This is multiplied by n, then sum over n. Using the same calculus trick, sum of npn-1 is 9. First three terms trials 0 1 and 2 don't count, you have to pull out at least 3 marbles. Subtract away 1 + 2p, leaving 20/3. The sum of nqn-1 is 9/4. Subtract away 1+2q and get 7/12. This is doubled so 7/6. Subtract from 20/3 and get 33/6. That's 5 and a half trials to see all three colors. k = 3: expected value = 5.5.

Move on to k = 4, hoping for a pattern. Last marble green: red blue and yellow up to that point, sum of n times ¾n-1, equals 16. Subtract lead terms 1 + 2×¾ + 3×¾2, leaving 11 + 13/16.

We counted just red and blue, and shouldn't have, and also red and yellow, and blue and yellow. Subtract 3 times the sum of n×½n-1. The sum is 4, minus lead terms 1 + 2×½ + 3×½2 yields 5/4. Multiply by three, and subtract from 11 + 13/16, and get 8 + 1/16.

All red was counted in the first sum, then subtracted away twice, that is, overcounted, in the second sum. We want it to be subtracted just once, not twice, so add it back in, and same for yellow and for blue. Three times the sum of n×¼n-1. Sum is 16/9, minus lead terms 1 + 2×¼ + 3×¼2 yields 13/144. Multiply by 3 and add it back in and get 8 and a third. As you go past 8 trials, you can expect to see all four colors.

the pattern is k times the sum of the first k reciprocals.

k = 1: 1 times 1/1 = 1

k=2: 2 times (1/1 + 1/2) = 3

k = 3: 3 times (1/1 + 1/2 + 1/3) = 5.5

k = 4: 4 times (1/1 + 1/2 + 1/3 + 1/4) = 8 and a third

The sum of reciprocals, up to k, is approximated by log(k), and that's how we get our answer.