Wednesday, January 9, 2013

A Quick Probability Conundrum: Follow-up

Last week, I posed the question "if I have twelve cards numbered 1-12, draw three per day without replacement, record their numbers, and then replace them and shuffle, how many days will it take before I should expect to see all twelve?" I was interested in seeing either a simulated or analytical result, and I got convincing ones of each.

Analytical solution

My friend Angelo is an astrophysics/applied math double major, so it's pretty safe to say he's better at math than I am. Here's his analytical solution to the problem, slightly limited by the "three cards per day" restriction.

P(x) = (# of strings of length x that contain each digit between 1 and 12 at least once) / (number of total possible strings of length x)

 = 1 - ((# of strings missing 1 or more digit) / (12x))






Actually, as he points out, this yields the number of cards you'd need to look at, not the number of days--but that's easy enough to fix by dividing by three. Here's the plot of the analytical cdf:
















which appears to cross P(x) = 0.5 around card number 35, or roughly 11 or 12 days. Here's his analytical pdf along with a simulation (n = 10000):
















They're a little off, and the simulation isn't quite smooth at n = 10000, but the agreement is reasonably good. For the "magic number" of P = 0.5, Angelo's analytical solution gives 11.7 days, and the simulated solution gives 12.7. Here's his full write-up.

More simulations

I did a simulation of my own with the following logic: start with an array of twelve zeros, pick three of those zeros at random to become ones, and repeat the process until all the zeros are ones. Then do the whole thing over again, 50000 times. My simulation was done in MATLAB. Here's the distributions I came up with:














I get basically the same shape as Angelo's graphs, though the extra 40000 simulations helps make the pdf a lot smoother. It's most likely to take 11 or 12 days, and the mean value of the pdf is 12.7, just like with Angelo's simulation. Strangely, my cdf crosses P(x) = 0.5 at x = 11.7 days, exactly one day too soon, so it's likely I messed up the integration of the pdf.


I got a nice verification of my simulation from another friend, Forrest, who actually does numerical simulations for a living, so there's a good chance his is far more rigorous than mine. I don't have his simulation code, but after one million simulations, his cdf looks similar enough to mine (note the change in scale on the x-axis):

















Forrest concludes that the "magic number" is somewhere between 11 or 12, putting his best estimate at 11.3.

Conclusion

The analytical solution and all three simulations put the P = 0.5 threshold somewhere between 11 and 13 days, and the mode or most likely occurrence of both pdfs is essentially a tossup between days 11 and 12. If we're looking for a single numerical answer to the question, I'm putting it at day 12: you're most likely to have seen all twelve cards on day 12. (Follow-up question: is it a coincidence that the number of the "critical" day is also the number of cards in the stack?)

Other interesting facts: it is in fact possible to be done on day 4, but it appears to be exceedingly rare. Only once in 50000 trials (or 0.002% of the time) did my simulation see all twelve cards in only four days. It's equally rare to have to spend more than 50 days: one simulation in 50000 (again, 0.002%) took 57 days to complete, but no other simulations took longer than 48 days.

Thanks to everyone who helped solve this problem! I'm both pleased and amazed that I found such an abstract, academic problem that is nevertheless relevant to real-world game design.

3 comments:

  1. It should be pretty easy to calculate the probability of finishing on day four. We start with the number "1", representing the fraction of runs that haven't yet drawn the same number twice. On day 2, we multiply it by (9/12)*(8/11)*(7/10) to represent the probability to draw one of the 9 cards (out of 12) that we haven't yet drawn, times the probability to draw one of the 8 remaining cards (out of 11 now), times the probability to draw one of the 7 remaining cards (out of 10 now). Repeating this for days 3 and 4 gives:

    Probability to win on day 4=
    (9/12)*(8/11)*(7/10)*...
    (6/12)*(5/11)*(4/10)*...
    (3/12)*(2/11)*(1/10)
    =0.00015777777...
    (which is precisely 21/133100)

    That means in 50,000 runs I would expect about 8 runs to finish in fours days, not just 1 run. Did you make sure that three different cards would be chosen each day by your MATLAB code?

    ReplyDelete
  2. That may well be the problem. See what happens when experimental people try to do simulations?

    Interestingly, the central tendencies of my distribution agree pretty well with yours. Maybe that error adjusts the skew more than the mean?

    ReplyDelete
  3. That error wouldn't have very much effect I'm sure (too lazy to simulate it right now but I might get bored in the next few days), but I'm guessing it would skew the whole curve just a little bit to the right. By the way I realized my convention for "N" is off of yours by 1 (what other people might call a "mistake"). So the real answer is 10.3 from simulation I think.

    ReplyDelete