Fun with Excel #18 – The Birthday Problem

Meeting someone with the same birthday as you always seems like a happy coincidence. After all, with 365 (366 including February 29th) unique birthdays, the chances of any two people being born on the same day appear to be small. While this is indeed true for any two individuals picked at random, what happens when we add a third, a fourth, or a fifth person into the fray? At what point does it become inevitable that some pair of people will share a birthday?

Of course, the only way to guarantee a shared birthday among a group of people is to have at least 367 people in a room. Take a moment to think about that statement, and if you’re still stumped, read this. Now that we know 100% probability is reached with 367 people, how many people would it take to reach 50%, 75%, or 90% probability? If you think the answer is 184, 275, and 330, then you would be quite wrong. Here’s why:

Let’s assume that all birthdays are equally likely to occur in a given population and that leap years are ignored. To paint a more vivid picture in our minds, let’s further assume that we have a large room and that people are entering the room one at a time while announcing their birthdays for everyone to hear. The first person enters the room and announces that his/her birthday is January 1st (we can choose any date we want without loss of generality). The second person has a 364/365 probability of having a different birthday from the first person and therefore a 1 - 364/365 probability of having the same birthday. The third person has a (364/365) \times (363/365) probability of having a different birthday from either of the first two people and therefore a 1 - (364/365) \times (363/365) probability of having the same birthday as either of first two people. The fourth person has a (364/365) \times (363/365) \times (362/365) probability of having a different birthday from any of the first three people and therefore a 1 - (364/365) \times (363/365) \times (362/365) probability of having the same birthday as any of first three people. To generalize, the probability of the nth person being the first person to have the same birthday as any of the n-1 people before him/her is:

P(n) = 1- \frac{364}{365} \times \frac{363}{365} \times \frac{362}{365} \times \cdots \times \frac{365-n+1}{365}

Note that the yellow series in the above graph grows exponentially rather than linearly, with the probability reaching 50% at just 23 people. 75% and 90% probability are reached at 32 and 41 people, respectively. By the time 70 people are in the room, there is a greater than 99.9% chance that two individuals will have the same birthday!

As the number of people increases, P(n) switches from exponential to logarithmic, with each additional personal providing less incremental probability than the previous. Interestingly, the 20th person provides the greatest incremental probability, as seen in the above table.

In contrast, the probability that any one person has a specific birthday is denoted by the much simpler equation:

P_1(n) = 1 - \left( \frac{364}{365} \right)^n

This relationship, which is highlighted by the green series in the graph, grows at a much slower rate than the yellow series. In comparison, it takes 253 people for P_1(n) to exceed 50%.

Testing Our Assumptions

One key assumption we made in the above exercise was that all birthdays (aside from February 29th) occur with equal probability. But how correct is that assumption? Luckily, Roy Murphy has run the analysis based on birthdays retrieved from over 480,000 life insurance applications. I won’t repeat verbatim the contents of his short and excellent article, but I did re-create some charts showing the expected and actual distribution of birthdays. The bottom line is that the actual data show more variation (including very apparent seasonal variation by month) than what is expected through chance.

Implications on Birthday Matching

Now that we know that birthdays in reality are unevenly distributed, it follows that matches should occur more frequently than we expect. To test this hypothesis, I ran two Monte Carlo Simulations with 1,000 trials each to test the minimum number of people required to get to a matching birthday: the first based on expected probabilities (each birthday with equal likelihood of 1/365.25 and February 29th with likelihood of 1/1461) and the second based on actual probabilities (sourced from the Murphy data set).

Note that the distributions of both simulations are skewed positively (i.e. to the right). The results appear to corroborate our hypothesis, as evidenced by the gray line growing at a faster rate than the yellow line in the above graph. Indeed, the average number of people required for a birthday match is 24.83 under the simulation using actual probabilities, slightly lower than the 25.06 using expected probabilities. However, the difference is not very significant; therefore, our assumption of uniformly distributed birthdays works just fine.

As always, you can find my work here.

Fun with Excel #17 (ft. Python!) – The Beauty of Convergence

Part I: Everything is Four

A friend recently told me that “everything is four.” No, he wasn’t talking about the Jason Derulo album

Suppose you start with any number. As an example, I’ll pick 17, which yields the following sequence: 17 -> 9 -> 4. Here’s a slightly more convoluted one: 23 -> 11 -> 6 -> 3 -> 5 -> 4. Figured out the pattern yet?

The Answer: for any number you choose, count the number of characters in its written representation. In our first example, the word “seventeen” has 9 characters, and “nine” has 4 characters. In our second example, “twenty three” has 11 characters (not counting spaces), “eleven” has 6 characters, “six” has 3 characters, “three” has 5 characters, and “five” has 4 characters. Try this with any number (or any word, in fact) and you’ll always end up with the number 4 eventually. This is because “four” is the only number in English that has the same number of characters as the value of the number (4) itself.

This occurrence is not unique to the English language: Dutch, Estonian, German, and Hebrew also “converge” to 4, while Croatian, Czech, and Italian converge to 3. Other languages may end up converging to more than one number, or end up in an infinite loop involving two or more numbers.

I decided to explore this phenomenon in more detail by examining how the series converges over a large set of natural numbers (1 through 10,000). Since Excel is unable to graph large data sets efficiently, I needed to relearn some basic Python to help me with this particular project. This was going to be fun…

Me Programming in a Nutshell

Anyway, the important takeaway from all of this is that I eventually succeeded. Look at this beautiful wedge:

In this chart, the x-axis represents the number of iterations or steps in a particular series (you can think of it as “time”), while the y-axis represents the value itself. Every point from (0, 1) to (0, 10000) is on the chart, and they are the starting points for the first 10,000 natural numbers. For example, going from the point (0, 17) to the point (1, 9) is one iteration. The points (0, 17), (1, 9), and (2, 4) represent a series (for the starting integer 17), and every series terminates once 4 is reached (in this case, 2 iterations/steps are required to reach 4).

With that explanation out of the way, there are a few observations we can make about the above chart:

  • Convergence occurs in a fairly uniform manner. On average, the convergence is relatively well-behaved and progresses in a decreasing fashion. Note that I say “on average,” because we know that 1, 2, and 3 are the only natural numbers whose first iterations lead to a larger number (1 -> 3,  2 -> 3, and 3 -> 5). However, every natural number greater than 4 will lead to a smaller number, and given how the English language works…
  • Convergence occurs very quickly. English is pretty efficient when it comes to representing numbers in written form. Larger numbers will in general require more characters, but not that many more. For example, among the first 10,000 natural numbers, the “longest” ones only require 37 characters (e.g. 8,878 being one of them). This leads us to our second chart…

This shows the sum of all 10,000 series over time. At Time 0, the sum is 50,005,000, but that drops to 292,407 after just one iteration. After Time 6, each of the first 10,000 series will have converged to 4 and terminated. If we define “stopping time” as the number of iterations/steps it takes for a series to reach 4, then the stopping times of the first 10,000 natural numbers are shown below (along with a histogram of how often each stopping time occurs):

The average stopping time is 4.4, with a standard deviation of 1.2. Furthermore, the vast majority of stopping times fall under 3, 5, or 6. Note the interesting pattern that forms from the series that have a stopping point of 4 in the first chart.

Part II: The Collatz Conjecture

Truth be told, “Everything is Four” feels a bit gimmicky for a mathematical rule (well, because it technically isn’t), so here’s a rule that is better defined: Start with any positive integer n. If n is even, then divide n by 2 (/ 2). If n is odd, then multiple by 3 and add 1 (3+ 1). Continue this indefinitely.

The Collatz Conjecture states that for any given n, the sequence you get from following the above rules will eventually converge to 1. Note that it’s called a conjecture, meaning that despite being proposed in 1937, it remains unsolved!

Now, let’s see how the Collatz sequence converges over the first 10,000 natural numbers. I’ve included 3 charts to show how things look at different “zoom” levels.

A few observations:

  • Convergence does not occur in an uniform manner. The chart almost looks like many different seismographs stacked on top of each other. While each series does eventually reach 1, how they get there appears to be somewhat random and not well-defined at all, filled with both sudden spikes and collapses over time. In any event, it’s a heck of a lot more interesting than our first sequence. Moreover, relative to our first sequence…
  • Convergence takes a long time. Recall that in our first sequence, no series among the first 10,000 natural numbers lasted for more than 6 iterations before converging to 4. In the Collatz sequence, however, the number 6,171 takes 261 iterations to reach 1.

This chart shows the sum of all 10,000 series over time. At Time 0, the sum is 50,005,000, but that spikes to 87,507,496 at Time 1 before dropping to 59,381,244 at Time 3. The beginning of the chart looks a bit like a failing cardiogram, and things quickly get weird after that. Of course, the sum eventually reaches 0, but the way it decays appears random. The stopping times of the first 10,000 natural numbers are shown below:

Wow! The first chart is really something isn’t it? It certainly doesn’t appear to be 100% random, and the fact that there seems to be some structure to the stopping times of the Collatz sequence could be a sign that a proof for the Collatz Conjecture can eventually be found. The average stopping time is 85.0, with a standard deviation of 46.6. Moreover, with a median of 73 and a mode of 52, the distribution of the stopping times appears to be right-skewed. An examination of the same chart featuring the first 100 million natural numbers confirms this.

Anyway, there wasn’t really a point to this post other than to show that there is often a lot of beauty hidden under the surface of mathematics.

Let me know if you would like to see more posts like this!

You can find my backups for both the data and the Python code here.

Fun with Excel #16 – Rigging Live Draws: The Emirates FA Cup

The Fifth Round Draw of the 2016/17 Emirates FA Cup was rigged.

Bold statement (literally), although that sentence probably meant nothing to anyone who doesn’t follow English Football (re: soccer) and the FA Cup in particular.

A quick introduction to the FA Cup competition, courtesy of Wikipedia (emphasis mine):

The FA Cup, known officially as The Football Association Challenge Cup, is an annual knockout association football competition in men’s domestic English football. First played during the 1871–72 season, it is the oldest association football competition in the world. For sponsorship reasons, from 2015 through to 2018 it is also known as The Emirates FA Cup.

The competition is open to any eligible club down to Levels 10 of the English football league system – all 92 professional clubs in the Premier League and the English Football League (Levels 1 to 4), and several hundred “non-league” teams in Steps 1 to 6 of the National League System (Levels 5 to 10). The tournament consists of 12 randomly drawn rounds followed by the semi-finals and the final. Entrants are not seeded, although a system of byes based on league level ensures higher ranked teams enter in later rounds – the minimum number of games needed to win the competition ranges from six to fourteen.

In the modern era, only one non-league team has ever reached the quarter finals, and teams below Level 2 have never reached the final. As a result, as well as who wins, significant focus is given to those “minnows” (smaller teams) who progress furthest, especially if they achieve an unlikely “giant-killing” victory.

It’s no secret that when it comes to the FA Cup, “giant-killing” victories are more exciting to the average viewer, and therefore better for TV ratings. Therefore, the tournament organizers are incentivized to create as many “minnow-giant” match-ups as possible. Specifically, this means matching up teams from the top level of the English football league system (more commonly known as the English Premier League, or EPL) with teams from lower levels (2nd Tier = Championship, 3rd Tier = League One, 4th Tier = League Two, 5th Tier = National League, etc.) While match-ups in the first 12 rounds of the tournament are determined using “randomly drawn” balls, it has been shown that such live draw events can be effectively rigged by cooling or freezing certain balls.

This year’s FA Cup Fifth Round Draw provided an interesting case study to test the rigging hypothesis, because out of the 16 teams going into the Fifth Round, 8 of them were from the EPL (Tier 1), while the remaining 8 were all from lower divisions. Coincidentally, the 8 EPL teams just happened to get drawn against the 8 non-EPL teams, conveniently leading to the maximum number of 8 “minnow-giant” match-ups. This result should seem suspicious even if you are not familiar with probability theory, but to illustrate just how unlikely such a result is, I will walk through the math.

In order to calculate the probability of the aforementioned result, we first need to figure out the total number of match-ups (i.e. pairs) that can be arranged among a group of 16 teams. As with most problems in mathematics, there is more than one solution, but perhaps the most intuitive one is this: Take one of the 16 teams at random. That first team can be paired up with 15 possible other teams. After a pair is made, 14 teams will remain. Again, we take one of the 14 teams at random. This team can be paired up with 13 possible other teams. By repeating this logic, we see that there are a total of 15x13x11x9x7x5x3x2x1=2,027,025 unique pairs. It turns out that mathematicians already have a function that simplifies this exact result: the double factorial (expressed as n!!). Therefore, we can generalize that for any group of objects, the number of unique pairings is equal to (n-1)!!

To calculate the total number of ways to draw exactly 8 “minnow-giant” match-ups, we can imagine putting all 8 of the EPL teams in a line. Since we are looking to match the EPL teams one-to-one with the non-EPL teams, the question becomes: how many different ways can we line up the non-EPL teams so that they are paired up with the EPL teams? The answer to that is simply 8x7x6x5x4x3x2x1=8!=40,320. It is important to understand why we keep the order of the EPL teams unchanged while we only change the order of the non-EPL teams; otherwise, we would be grossly over-counting!

The probability of drawing exactly 8 “minnow-giant” match-ups is therefore 40,320/2,027,025=1.99%, or just a tad under 2%! To verify this, I ran a Monte Carlo simulation involving 50,000 trials, of which 961 trials ended up with exactly 8 “minnow-giant” match-ups, or 1.92%. The below table and chart also show the theoretical probabilities of drawing “minnow-giant” match-ups, for 0 ≤ n ≤ 8. (Bonus Question: Can you convince yourself why it’s impossible to draw an odd number of “minnow-giant” pairs among a group of 16 teams?)


But wait, it gets even better. Out of the 8 non-EPL teams, 4 teams were from the Championship (2nd Tier league), 2 teams were from League One (3rd Tier), and 2 teams were from the National League (5th Tier). Arsenal, which has been sponsored by Emirates since 2006, ended up drawing Sutton United, one of only two teams (the other being Lincoln City) from the National League (5th Tier). Now, what are the chances that the team that shares a sponsor with the competition itself ends up drawing one of the two easiest (in theory) match-ups available?

The number of ways for Arsenal to draw a National League (5th Tier) team (i.e. either Sutton United or Lincoln City), without any restrictions on how the other match-ups are drawn, is 270,270. We arrive at this number by first assuming Arsenal and Sutton United are already paired off, thus leaving 14 teams reaming. The 14 teams can be paired off in 13!!=135,135 ways without restriction. We can repeat the same reasoning for an Arsenal/Lincoln City pair. Therefore, we double 135,135 to arrive at 270,270. This yields a theoretical probability of 270,270/2,027,025=13.33% (Monte Carlo resulted in 6,620/50,000=13.24%), which is almost 1 in 6. However, this is only the probability of Arsenal drawing a 5th Tier team with no other match-up restrictions. In reality, there were already 8 “minnow-giant” match-ups drawn in the first place.

Therefore, the question becomes: what is the probability that 8 “minnow-giant” match-ups are drawn AND Arsenal draws a 5th Tier team? We already know there are 40,320 possible match-ups for the first part of the requirement. Satisfying both parts of the requirement must result in a number smaller than 40,320. Think of it like this: we start off with the fact that the 8 EPL teams are matched up one-to-one with the 8 non-EPL teams. There are 2 different ways to pair Arsenal with a 5th Tier team (since there are only 2 such teams). Of the remaining teams, there are 7!=5,040 ways to pair them off such that the EPL and non-EPL teams are still matched one-to-one. Therefore, the total number of match-ups satisfying both requirements is 2×7!=10,080. This yields a theoretical probability of 10,080/2,027,025=0.50% (Monte Carlo resulted in 250/50,000=0.50%).

In conclusion, there was only a 0.50% chance that the 2016/17 Emirates FA Cup Fifth Round Draw would lead to exactly 8 “minnow-giant” match-ups AND Arsenal drawing 1 of the 2 National League (5th Tier) teams. The fact that it happened anyway suggests that the drawing process may not have been 100% random.

As always, you can find my back up here. Please note, however, that I had to change all of the Monte Carlo formulas to values and save the file as .xlsb instead of .xlsx, as the file was way too large before (71 MB).

I would also like to give credit to the Chelsea subreddit for inspiring me to explore this topic.

My Take on the Winter Olympics Medal Count

So I was browsing the web about a week ago when I saw this article: Yahoo: Fourth Place Medal Article

Like many avid fans of the Olympics, the author expresses discontent at the IOC’s current system of “ranking” countries on their performance during the games. While I do not agree with the article’s proposed alternate medal system, the author does bring up a good issue that I have though about many times in the past, i.e., what is the best way to rank how countries perform during the Olympic games? While most of the world seems to go with the IOC’s standard of “gold medals first”, the U.S. media sticks stubbornly with the “overall count”. Obviously, there are flaws with both schemes, and it is the goal of this article to point out those flaws and propose a more fair ranking system for the Olympics games.

Continue reading “My Take on the Winter Olympics Medal Count”

Another Week

On Friday, I went to TIan Tong Yuan to see my aunt. Although she and her family moved to Tian Tong Yuan about two years ago, I had never been to her house. Friday also happened to be my cousin Meng Meng’s 30th birthday, so I thought it was a good opportunity to accomplish two things with one visit. I took the subway to Tian Tong Yuan and met up with Fan Wei at grandma’s house. The two of us then headed over to my aunt’s house, which was only 15 minutes away by foot. My aunt greeted us cheerfully at the door and welcomed us in to a partially furnished living room. Her apartment is a lot bigger than grandma’s, she tells me, so it has taken them a long time to move in and get furniture. As we talk in the dining time (there’s no couch in the living room), I can’t help but admire the woman sitting across from me. Although she’s almost 60, my aunt does almost all of the housework in the family. The relationship between her and Meng Meng is strained at best, and her husband has been idle ever since he lost his job some time ago.

Fan Wei and I treated Meng Meng to dinner at a Korean restaurant not too far from his house, and after we said our goodbyes, I went with Fan Wei to her new apartment, where I spent the night. Although I had to sleep on the couch, I didn’t mind at all, since I was more than happy to be in a room with air conditioning again.

Looking forward, I can’t believe it’s only a week until Fan Wei’s wedding, and a week and a half until I leave for the U.S.! I’ve been watching 24 (currently on season 3), learning differential equations, and playing some old school StarCraft over the past week to keep myself busy until then. Talk about a weird combination.