All Things Data Science: June 2012

Forget Paul the Octopus, Chanakya the Fish, and all other football predicting animals. It appears that the stock market can be used to predict the outcome of football matches! Now that the stock markets are not doing so well, at least we can use them to make a few bucks on football betting sites.
Take last weekend's game, for instance, when Germany played Greece. It turns out that the score evolution of this game followed a very similar pattern than the German Greek spread from half March to half June 2012. The pattern is obvious in the graph below. The correlation between the two series is an impressive 93%.
The attentive reader will notice that there is a gap between the 45 and 60 minutes marks. Indeed the stock markets predicted that the third and fourth German goals would have been scored earlier. Specialists are investigating whether this has to do with the break after the first half . But, other than that small gap, the fit between the two lines is very close.
These results don't come as a surprise. Earlier it was reported that Twitter can be used to predict the stock market (see Twitter Can Predict the Stock Market in Wired). And now it turns out that, on its turn, the stock market can be used to predict the results of football games. Whether Twitter itself can predict the football results directly remains an open question.

And now what really happened:

A couple of months a go I read an interesting article entitled "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" by Daniel Gayo-Avello in which he explained that, despite some claims in the popular press, the predictive power of Twitter is far from proven. The following quote is from that paper, and to me it summarizes well what the problem with a lot of these claims is:

"It’s not prediction at all! I have not found a single paper predicting a future result. All of them claim that a prediction could have been made; i.e. they are post-hoc analysis and, needless to say, negative results are rare to find."

Some other interesting reading in this area is "The junk science behind the ‘Twitter Hedge Fund’", and "Sour Grapes: Seven Reasons Why That Twitter Prediction Model is Cooked." by Ben Gimpert. One of the recurring themes is that by being selective in what data you will show you can easily relate many events that logically should be independent.

Currently there is a debate in Belgium about the construction of Uplace, a huge shopping mall on the outskirts of Brussels. A marketing professor of the Vlerick School, Gino Van Ossel, did a survey which showed that, contrary to popular believe, quite a big group was actually in favor rather than opposing the plans. So far so good, but the professor's methodology was questioned in the media. The arguments used against the survey findings were not very strong, I believe, and I will not discuss them here.

However, a part of the reasoning used by Gino Van Ossel looked rather odd to me and made me think about a more general problem that I would like to discuss here. Those of you who understand Dutch can find all the details on www.marketingblog.vlerick.com. In short, he found that in a sample of 654, representing the total Belgium, 33% was in favor, while in the region where the shopping mall would located, with a sample of 182, 46% were in favor. This was against the popular believe that people in the neighboorhood were strongly against the shopping mall. As a consequence, all kinds of arguments were used to undermine the study. Most of those arguments were not very convincing, in my opinion. One of those arguments was that based on about 80 persons in favor of the shopping mall (46% of 182) you can't make general statements about that part of Belgium. Of course from a statistical perspective you can if you accept a certain precision with a certain confidence.

The part that sparked my attention, however, was when Gino Van Ossel referred to an election study with a sample of 1024 in which statements were made about one particular electorate (the Green Party) which holds about 9% of the votes, or 92 in that sample. His argument was that if you accept that statements are being made based on such a small number of people in this study, you should also accept statements from other studies using a similarly small number of people.

And that's the point where I don't agree anymore. Obviously, given a certain confidence, other than the sample size, the accuracy will also depend on the proportion of the successes. Let's try to formulate that a bit more formally. Assume we have a small sample $n_1$ and a large sample $n_2$ from the same population ($n_1 < n_2$), but the number of successes is equal ($S_1=S_2=S$). The question is what happens with the standard errror in both cases? Obviously $p_1={S \over n_1}>p_2={S\over n_2}$. For simplicity's sake we will, for the moment, assume that both $p_1$ and $p_2<0.50$. On the one hand we can say that as $n_1<n_2$ the standard error in the first case will be larger than in the second case ($SE_1>SE_2$). Moreover, as $p_1>p_2$ and assuming that both $p_1$ and $p_2<0.50$ the standard error in the first case will be larger as well ($SE_1>SE_2$).

Formally, we can say that:
$$SE_1=\sqrt{p_1(1-p_1)\over n_1}$$ and
$$SE_2=\sqrt{p_2(1-p_2)\over n_2},$$
with $SE_1$ and $SE_2$ representing the standard errors of the two cases. Let's now consider the ratio of these two standard errors:
$$
{SE_2\over SE_1}={\sqrt{p_2(1-p_2)\over n_2}\over\sqrt{p_1(1-p_1)\over n_1}}
$$
For convenience's sake we'll square both sides and re-express
$$
{SE_2^2\over SE_1^2}={ n_1 p_2(1-p_2)\over n_2 p_1(1-p_1)}
$$
Let's call the ratio of $n_2$ and $n_1$, $k$, so that we can express $p_2={S\over k n_1}={p_1 \over k}$. This yields:
$${SE_2^2\over SE_1^2}={ {p_1 \over k}(1-{p_1 \over k}) \over k p_1(1-p_1)} $$
$${SE_2^2\over SE_1^2}={ 1-{p_1 \over k} \over k^2 (1-p_1)} $$
$${SE_2^2\over SE_1^2}={ k-p_1 \over k^3 (1-p_1)} $$
$${SE_2\over SE_1}=\sqrt{ k-p_1 \over k ^3(1-p_1)} $$
Thus in this case we can express the gain or loss in precision in terms of the initial proportion $p_1$ and the relative sample size. In words we can say that the gain or loss in precision is equal to the square root of the difference between the sample ratio and the initial probability of a success divided by the product of the third power of the sample ratio and the probability of a failure.

We'll use an example that is relatively close to the example discussed by Gino Van Ossel, $n_1=200$, $n_2=1000$. $S_1=S_2=S=80$, and thus $p_1=0.40$ and $p_2=0.08$.$k=5$. Plugging those numbers in the formula yields 0.2477. So, a statement on the proportion of respondents opting fro the Green party will have a precision that is about 4 times better than a statement about the proportion of respondents that are in favor of Uplace, even though the number of 'successes' is close to each other.

Since in this case we want to consider two samples with the same number of successes, we can, from a binomial distribution perspective, reformulate the problem as follows: what is the change in standard error if we increase the sample size, but keep the number of successes constant. In other words, what happens if we only add failures, not successes. Likewise we can also consider the case where relative to the original situation we decrease the sample, but by taking away the failures (and thus leaving the number of successes constant).
We can also see that if $p_1>k$ the ratio is not defined. As $p_1\le1$ this will only happen when $k<1$, i.e. when we decrease the sample size rather than increasing it. Since we are keeping the number of successes constant we can't decrease any further as soon as $p_1>k$.

Of course, we're not suggesting to follow this procedure in practice as it would introduce bias, but it helps explaining why the comparison made by Prof. Van Ossel is not warranted.

That said, we can also think of what happens if we let the value of $p_1$ vary between 0 and 1. Similarly we can inspect what happens as $k$ goes from 0 to 1, i.e. the case of sample sizes smaller than the original, followed by what happens as $k$ increases to plus infinity, i.e. the case of an increasing sample size.

The picture above illustrates this graphically. In this case we let $k$ only vary from 0 to 4. So, left of the line where $k=1$ we effectively look at cases where the sample is decreased by taking out failures (and thus leaving the number of successes constant). It will come to no surprise that standard deviation will decrease as we increase the sample. As the initial probability $p_1$ becomes higher we will see that the ratio is becoming undefined, and hence not drawn.
It might be difficult to see, but obviously where $k=1$ the ratio of the standard deviations is always 1. As we move to values of $k$ higher than 1, the value of the ratio falls below 1, indicating that increasing the sample size generally decreases the standard error. But not always. At very high levels of $p_1$, adding a failure can actually increase the standard error

That said, to come back to the initial problem regarding the Uplace shopping mall, the above illustrates that in comparing a sample of about 1000 with about 80 successes, with the case of 80 successes in 200 observations you should take into consideration that the precision of the former is 4 times higher than the latter.

All Things Data Science

Monday, June 25, 2012

Eerie similarity: Stockmarket predicts Germany-Greece soccer game result!

Sunday, June 10, 2012

Uplace and the binomial distribution

About Me