Wednesday, July 22, 2009

SVD Analysis of the 2007 Tour

I previously posted about using singular value decomposition (SVD) to analyze stage race results and estimated that based on the patterns in the results, grand tours could be reduced to about 3 or 4 stages without losing much information. The character of these “composite stages” depends on the results of each race on a case-by-case basis. Here I consider the 2007 Tour de France, which was the “simplest” tour, effectively having only 2.8 stages.

SVD basically rearranges all of the results from all of the stages into a series of composite stages, the modes, that appear in the data with decreasing weights. The first mode is the most prominent pattern in the data, the second mode the second most prominent, etc, and the weight (aka the singular value) of each mode is that composite stage’s importance in the global data set. The rearrangement can be represented as a raster plot. Here are the modes and weights for the 2007 Tour:

The SVD modes are read as columns in the raster plot, with red and green corresponding to greater and lesser times for each stage.

Looking up the first mode, it is not surprising that Stages 8, 14, 15, and 16 were mountain stages in the 2007 tour. In most grand tours, the mountain stages dominate the major modes because the time gaps in the mountains are so much greater than those in other stages. Mode 1 in particular has an much larger weight than any other mode, and it describes 93% of the results in the Tour. It can be though of as the base pattern for any rider’s results. This first mode alone correctly determines the three riders on the final podium (Contador, Evans, Leipheimer), although it interchanges Evans and Leipheimer’s final placings. It primarily encodes the climbing stages, but also includes some information on the ITTs in Stages 13 and 19. The second mode encodes a correction to the first, which is a time lost in the Alps relative to the Pyrenees, presumably representing riders who grow stronger in the last week. Mode 3 is mostly gains on Stage 9 paired with losses on Stage 15. Modes 4-8 contain further corrections to the mountain stages. Modes 9-14 generally account for time gaps in stages where breaks played a role and details of time trials. Modes 15-21 are slight differences in sprint stage and prologue performances – these weights are extremely small since most of the field finished sprint stages with the same time.

Each riders’ individual results can be recomputed by summing up these patterns, with each pattern separately weighted according to the individual rider. A GC rider, therefore, will have a small weight for Mode 1 since they didn’t lose much time on the decisive stages, whereas a consistent member of the autobus will have a large weight for that mode corresponding to their large time losses in the mountains. We can, for instance, look at the first two modes for each rider. This plot shows the extent to which each rider’s results (dot) exhibited Mode 1 (x-axis) and Mode 2 (y-axis), with final GC placing running from red to blue:

GC riders have lower values for Mode 1 since they lost the least time on important stages. The blob of blue on the far right is the autobus riders who consistently lost a log of time. Zooming in on the top 10 we get an idea of how the GC is scattered:

Contador has the smallest Mode 1 component – he lost the least time in the big stages. Along with his teammates Leipheimer and Popovych, he also did relatively well later in the race, as signified by his large Mode 2 component. Although Leipheimer did well overall (Mode 1), his early losses in the Alps (Mode 2) were not great enough to overcome Evans. We could continue to add modes and represent the data in higher dimensions, but I think two is enough for a blog post.

The 2007 Tour is a fairly typical case for SVD. Most Tours show a similar dominant mode that combines climbing and time trial stages, encompassing most of the GC. It usually takes the addition of one or two more modes to work out the precise order of the podium and a couple more to fill in the details of the top 10, but the majority of activity encoded in the smaller modes are mid-GC reassortments due to breaks.

I think this generally means Tour results can be summarized by a few patterns in the data and therefore major changes in riders’ performances in the course of a grand tour are very rare. This is, however, a global and quantitative view of the data and should be taken as such; the difference between fourth and seventh place for an individual rider is still be quite important to them and their supporters! Vinokourov’s miniscule time gains on the Champs Élysées in 2005 do not appear until Mode 19 in a 2005 SVD analysis, but it certainly mattered to him and Levi Leipheimer.

Next I’ll consider the 2006 Tour de France. How does Oscar Pereiro’s unorthodox victory look through SVD glasses?

No comments:

Post a Comment