October 2013 – Jeffrey Fan

As some of you may recall, I kicked off the Fun with Excel series with a post on attraction, where I hoped to explore the mechanics of physical attraction from a statistical perspective. Due to the amount of feedback I have received (positive, skeptical, or otherwise), I have decided to write a follow-up post.

In the first part of The Laws of Attraction, I focused solely on physical attraction, and the impact of bias in our perception of attractiveness on seeking a compatible partner. In part two, I focus on the bigger picture: given a set of personal traits, what is the probability that you will find someone with those traits at the specific level that you desire?

Background: In the song One In A Million, Ne-Yo sings about a girl who he calls “one in a million.” Of course, not content with just enjoying the music, I wondered to myself what it actually meant to be “one in a million.” One way of measuring this is by breaking down attraction into a larger set of personality traits and trying quantify our desires, which is essentially what online dating services do with their “matching formulas.” For purposes of our exercise, let’s say you have a list of 10 distinct characteristics that you believe to be important and that you actively look for when searching for a partner. You might be more picky on some traits than others, but it isn’t too hard to quantify your objectives. Similar to my previous project, I quantify these objectives in terms of percentile, which, at least from a guy’s perspective, is pretty straightforward. For example, I might say, “I’m only interested in a girl who’s in the 80% percentile for Trait 1, 90th percentile for Trait 2, 50th percentile for Trait 3…” and so on and so forth. Now, the question is “what are the chances that such a girl exists?” A closely related question is “how many such girls are out there?”, followed by the not-so-fun reality-check of “what are the odds that I’ll actually find such a girl?”

The Model: While we won’t tackle the last question in this post, the first two are pretty straightforward to simulate from a mathematical standpoint. For each trait, the probability of finding someone who is at the X-percentile or higher of that trait is (100-X)%. For multiple traits, all we have to do is multiply these probabilities together, but the key assumption here is that all the traits are independent. Obviously, this isn’t true in real life, but we’ll revisit this point in a little while.

Assuming we start with a set of 10 traits, I will define a person having N “Perfect” Traits as someone who ranks at the 90th percentile or higher in N traits, and at the 50th percentile or higher in the remaining 10 minus N traits. Thus, assuming a world population of 7.12 billion, a male/female split of 50/50, and that you are heterosexual, the number of potential partners with 0 “Perfect” Traits walking on the planet is 3,476,563, or 1 in 1,024 (the mathematically inclined should immediately realize that 1,024 = 2^10). On the other hand of the spectrum, there are theoretically only 18 people with 9 “Perfect” Traits, or 1 in 200 million. Note that a person with 10 “Perfect” Traits technically doesn’t exist, as probability indicates a 1 in 10 billion chance. At this point, the astute reader will note one possible answer to Ne-Yo’s earlier problem: if you consider a smaller set of 6 traits rather than 10, a “one in a million” girl would simply be a girl who has 6 “Perfect” Traits (all 6 traits at the 90th percentile or higher) in that scenario.

The Results: I plotted the entire spectrum of N “Perfect” Traits in the scenario of 10 traits, to arrive at the following graph:

It should be no surprise that our graph strongly resembles a normal curve, as we are working with a binomial distribution.

I suppose the lesson here is that it doesn’t pay to be picky, but recall the very important (and incorrect) assumption we made earlier that all traits occur independently of one another. In the real world, however, this couldn’t be further from the truth. Creativity may be correlated with Curiosity, Honor may be correlated with Kindness, and Intelligence may be highly correlated with (or the cause of) all the other traits. Accounting for the dependencies between and among all 10 traits would require us to estimate both marginal and conditional probabilities, which would not only be difficult, but also complicate our model very quickly. Statistical mumbo jumbo aside, what this means is that the probabilities estimated by a simple binomial model are far too conservative (too low). This should be great news for all the picky daters out there.

An alternative way of tackling the dependent traits problem is to simply consider a smaller set of traits. For example, if we created a list of 10 traits, and then realized that two of them were very highly correlated with each other, then we could eliminate one of them and simply consider a 9 trait model, which in turn would be a more accurate simulation of what the actual probabilities might look like in real life. To that point, I also plotted out graphs for scenarios involving 7 traits and 5 traits:

Note that as we decrease the number of traits, the number of potential candidates increases exponentially. So if you only considered 5 main traits, and furthermore were only picky about 3 of them (3 “Perfect” Traits in the graph above), then you would only be looking at a probability of 1 in 400. Not bad.

Conclusion

At the end of the day, it is perhaps silly to attempt to model real life human dynamics with 50 lines in Excel. But that would also be missing the point of the exercise. Thinking about real world problems from a different perspective (whether it is psychologically, statistically, or otherwise) can shed new light on the issue, or simply affirm something we already knew or suspected. Even if it is only the latter, there is still value derived from being able to connect the dots between a variety of different frameworks.

As for me, my dream girl in the 10 Trait model is about 1 in 5,925,926, and about 1 in 53,333 in the 5 Trait model. I’m not sure if I’ll ever find her, but it’s satisfying to know that she’s out there.

-J

My father was a big fan of the Chicago Bulls back in the ’80s and ’90s, so I had the good fortune of watching some of the best playoff basketball (i.e. Michael Jordan) that the NBA (and the world) has ever witnessed. Perhaps that is the same reason why the last decade or so of NBA basketball has seemed to pale noticeably in terms of excitement. It is generally agreed upon among basketball fans that the game as it is played today is (a lot) less physical (and perhaps less exciting) than it once was.

Officiating has also seemingly become a bigger determinant of results, and like virtually all professional team sports, the blame often lands on the referees. “If it weren’t for that call, they would have won the game,” is a phrase we hear all too often, and one that I am guilty of committing as well. However, have changes in officiating really been that significant over the last few decades, and if so, how would we measure such a phenomenon? The answer, of course, lies in the numbers.

Luckily, statistics for the NBA are readily available, but for the purposes of my project, I decided to look at playoff statistics from the 1983-84 season to the latest 2012-13 season. However, even if the data is easily accessible, oftentimes the most time-consuming aspect of a project is collecting the data and organizing it in a way that makes it easy to analyze. This was no exception. Luckily, with a little vlookup and text parsing (the latter is needlessly complex in Excel) magic, I was able to largely automate the process of converting 30 years of raw playoff data into something I could process more easily.

My first goal was to see if there were any high level trends in the NBA playoffs through time, in particular the number of games played and the point differential in each game. Moreover, I wanted to analyze these metrics by playoff round (e.g. first round (1R), conference semifinals (2R), conference finals (3R), and finals (4R)). If we were to believe that officiating actually had a measurable impact on playoff results, we may expect to find the following:

An overall longer playoff campaign
Smaller average point differentials, to convey the appearance of “closer” games

Why would the NBA want any of these things to happen? The answer is simple: profits. More games played/closer games = more tickets sold/higher TV ratings. In fact, the NBA switched from a best-of-five format to a best-of-seven format in the first round starting in the 2003 playoffs.

The Results (and the data)

I’ll make a few observations, but the data really speaks for itself here. In the first chart, we see that after adjusting for the NBA’s change in playoff format since the 2002-03 campaign, both the average numbers of games played in the playoffs and the average number of games played per round has not shown any noticeable shift through time. The average points differential chart shows the same story, and in fact both charts seem to suggest some cyclical trends through time. Lastly, the average free throw attempts and fouls chart actually displays a noticeable decrease through time on a per game adjusted basis. Perhaps this is a testament to just how physical the game was back in the 1980s and 90s, which MJ himself has suggested on many an occasion.

Conclusion

The data doesn’t seem to indicate any obvious playoff trends that may have been caused by officiating. However, more granular foul data (which may not be available) may help clarify the story. In particular, even if the average number of fouls per game has trended down over the last 30 years, have the types of fouls called changed in any significant way? Perhaps more calls are coming during particularly tight stretches of games, or conversely, during blow outs, to ensure that the losing team is “still in it.” Of course, all of this is pure speculation, and without hard evidence, it is difficult to move forward. As Sir Arthur Conan Doyle once said through his most famous character Sherlock Holmes, “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” Until such facts are found, our theories will remain theories.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Jeffrey Fan

Random Musings of an Amateur Data Scientist

Month: October 2013

Fun with Excel #4 – The Laws of Attraction II

Fun with Excel #3 – Corruption in the NBA?