Tuesday, July 28, 2009

Best Domestiques (Podium Edition)

I recently posted a statistical analysis that identified domestiques who are associated with better team results. For example, I found that when Quick Step started Kevin Hulsmans in a race last year, their best finish was an average of 10 places higher than when they did not. So you might say Hulsmans was worth 10 places to his best Quick Step teammate. I also calculated how significant the effects were in terms of the statistical likelihood that such an effect might be a random fluctuation. In doing this, I used log-transformed results to put more weight on better placings. This basically made the difference between 1st and 10th as important as the difference between 10th and 100th. Although this approach is fine for some purposes, I think it still underestimates the importance of a top finish.

This post will propose an alternate method that focuses on podium placings. In a bike race, top 10 results are satisfying only in that they suggest the potential for a 1st, 2nd, or 3rd place finish down the line. So here I will ask if certain riders increase the frequency of their team achieving a podium position. As with the previous analysis I will do this on both a year-by-year and career basis, including races between 2002 and 2009.

As an example, consider Marco Velo. From 2002 to 2008 Velo was a leadout man for Alessandro Petacchi, one of the era's dominant field sprinters, and now performs similar duties on Quick Step. Over that time, Velo has appeared in the results of 281 races, 69 of which have a teammate on the podium (not Velo). His teams also raced 461 times without him, with 57 podiums. So Velo's team has achieved more podiums in far fewer races Velo contested: 69/281 versus 57/461. This corresponds to an odds ratio of 2.3, meaning that it was 2.3 times more likely that Velo's team made the podium when he was in the race. That sounds pretty good, right? But, of course, you also want to know if this a significant difference given these sample sizes. We can use a statistical test to determine that the likelihood of this effect in random data is P = 2e-5, or 0.0002%. Quite significant, suggesting that Marco Velo is an excellent domestique. Good for him.

Using CQ data for all riders (see the riveting technical notes below and on previous posts for more details), I went searching for other extraordinarily valuable domestiques. I identified every rider/year combination with a P less than 0.01. Here are the rider, year, team, odds ratio, P value, and most common teammate on the podium for each significant finding:

The odds ratio is how many times more likely it is that a teammate gets on the podium when the listed rider is racing (larger is better). Infinite results (INF) occur when the team never placed on the podium without the rider present. The P value is the chances that this result might have arisen from random noise (lower is better). I also did the same calculation for each rider's career -- at least using the results I have from 2002-2009:

We can compare these two tables with the previous results and see that there is a fair amount of overlap. For instance, the 2008 season for Kevin Hulsmans is still significant, but now instead of saying he's worth 10 placings we can credit him with a three-fold increase in podium spots. Notably missing is the 2003 incarnation of Andrea Tonti, whom I previously declared to be the best domestique ever. Although his presence corresponded to an astounding gain of 33 placings, he wasn't around for enough teammates' podiums to make this list. So he might be an example of moving teammates into the top 10, but not all the way to the big money.

As before, I'm not implying that a domestique whose specific presence doesn't yield enhanced podium returns isn't doing his job well. He might be on a team that is always putting riders on the podium, or a team with second-rate team leaders who rarely crack the top three. Basically all I'm doing here is identifying domestiques who have shown a pattern of association with good team results. Determining whether the domestique is actually causing the better results is a judgment call that the statistics cannot make.

I prefer this method to my previous one, primarily because it's easier to understand and focuses better on top results. However, this it's bedeviled by some of the same issues. A couple major ones are:
  • False positives. The significance levels appear to be quite low, but since I've done thousands of tests there may be many false positives here. However, I'm not sure how independent these tests are so I can't easily compute a correction. I would have to do a large number of permutation tests to get an empirical idea of the precise false positive rate.

  • Disregarded cofactors. As we all know, correlation does not necessarily imply causation. An analysis like this may be fraught with causal variables that have been ignored in the analysis. For example, it is difficult to separate the contribution of one domestique from another, and from that of the team leader. Many of the riders on the list are Alessandro Petacchi's leadout train (Velo, Ongarato, Tosatto). Were these guys extraordinarily suited to leading out their man, did one of them carry the weight for them all, or were they just lucky to be working for the fastest guy around? It might be impossible to separate the contributions of Petacchi and his leadout train with the results I have, but it's worth thinking about. This analysis doesn't really try. Another cofactor is the nature of specific event. Since pack finishes are so common, domestiques that aid in sprints will have more significant results due to the greater sample size of sprints.

Technical Notes: Data source is Cycling Quotient. To avoid partial result listings, I considered approximately 900 races in which more than 100 riders are listed in the results. Roughly half of the races were stages from grand tours, and the remaining results are mostly the major one-day races and lesser stage races. Individual time trials and national team events were excluded from the analysis. To avoid small sample sizes, odds ratios and P values were only computed if there were five results in every test set. The odds ratio is defined as p_r(1-p_r)/p_nr(1-p_nr), where p_r and p_nr are the frequencies of a team podium place when the rider is and is not in the race, respectively. P values are calculated using Fisher's exact test, which assumes a hypergeometric distribution for the null hypothesis.

No comments:

Post a Comment