Wednesday, July 22, 2009

The Tour de France has 3.4 Stages

With a time trial and mountain stage remaining in this year’s Tour de France, there are no lack of assurances that the Tour is not over. But once the GC has been sorted out by the early stages, how much do the results actually change? Is there any major variance in the results between each type of stage, or can the Tour be reduced to one representative mountain stage, a representative time trial, and a typical sprint stage?

To determine how complex the Tour is, I calculated the singular value decomposition (SVD) of stage finish times. SVD is a linear algebra technique that rearranges data into a series of features, called modes, weighted from the most important to least important. For example, it is used in image compression to reduce a matrix of pixels to an approximate matrix that is reconstructed from the a handful of modes. Furthermore, by looking at how quickly the mode weights decrease, one can estimate the effective number of independent components in the data. A famous example was a study of correlated voting records in the US Supreme Court, which concluded that the nine justices could be approximated well by 4.7 independent justices with uncorrelated voting patterns. Another way of stating this is that the information content of US Supreme Court decisions is the same as a court of 4.7 justices rather than nine.

Looking at Tour results, we can ask the same question. The Tour is usually 21 stages, but we can perform SVD analysis and determine the effective number of stages. I did this for a number of recent grand tours using results for time lost on each stage from Cycling Quotient. Here are the results:

The average Tour effectively has 3.4 stages, the average Giro 3.3, and the average Vuelta 4.3. The most complex tour here is the 2003 edition, which involved multiple long breaks and a closely contested GC. The Vuelta is consistently more complex than the Tour and Giro, perhaps because breaks are allowed freer reign or riders in the lead are more likely to crack after a long racing season.

So does the SVD analysis reduce the race to a single climbing, time trial, and sprinting stage, with perhaps a “half stage” for hilly transition stages? To fully answer this, we need to look at each stage race on a case-by-case basis. I will look at a few races in forthcoming posts. Briefly, since the consistently good climbers also tend to be the better time trialists, time trials and climbing stages are often combined into a single representative “GC mode”. Additional modes encode variations around this main trend, such as a difference for some riders between the Alps and Pyrenees, and breakaway results. It should be kept in mind that this analysis is entirely data driven – the modes correspond to how the results panned out rather than any preconceived notion of which stages are important in a grand tour. Since I do not have a handy source for complete results during the Indurain years I could not do the analysis, however I imagine the balance of time trialing and climbing might have been different then.


Technical notes: SVD is a linear algebra routine that produces a unique solution for a given data matrix. The results matrix was composed of stages x riders, such that each matrix element is the time lost in seconds for a given rider on a given stage. Time bonuses were not factored in because of personal laziness. Riders who did not finish all stages were excluded since SVD chokes on missing data. The effective number of stages is computed as the Shannon entropy for the fractional singular values squared. Missing grand tours are due to incomplete results in the CQ database.

No comments:

Post a Comment