Wednesday, March 16, 2011

Dipsea Data (Initial Data)

In response to a thread I've been participating in on the Tamalpa Runners Message Board, I've begun a statistical analysis of Dipsea records. The thread started with the question of whether the boys age 8 handicap gave them a disadvantage over other runners. I don't know anything about the methods actually used and regretably have not had a chance to buy and read Dipsea expert Barry Spitz's book. This may be a benefit to this small project since I can consider a fresh approach.

I promised an analysis which would determine the ideal handicaps for every age/gender combination. As everyone in Marin County, CA and many people outside of Marin knows, the Dipsea is a 7.1 mile trail race, famous for allowing alternate routes and for giving runners head starts based on age and gender. I proposed that handicaps should be based on projected times for each age gender as determined by a best-fit line calculated using historical Dipsea times.

Barry provided me with all of the single age/gender records. There are 162 total records - women aged 5-77 and men aged 6-95, excluding men aged 91 as the record holder Jack Kirk's official time is unavailable. Using this data I've developed 3 graphs which plot each record. Each graph also has a best-fit line. A best fit line is one that best appromixates each Y-value (time) for every X-value (age). For this line I used a polynomial formula with an order of 6. The best-fit line is determined by calculating the minimum variances as determined by comparing the line with the various data points.

Here are the graphs with brief explanation:

This is the master graph. All data is reflected here. Women's data (red) and men's data (blue) are plotted. The women's best-fit line is in green and men's in orange. One thing I noticed is it appears that the women's best-fit line is not a U. The men's is a U - starting at a young age, times get faster to a certain point and then get slower. But the women's seem to bottom out around age 19, then rise a little, then bottom out again in the mid 40's. The data does this too - if you see on Barry's blog, women in their teens can run under an hour, women in their early 20's seem unable to, and women starting at age 26 seem to be able to again. There is possibly a confounding variable there as most of the sub 1 hour times were from the 1980's and the best times at other ages tend more toward recent years. I will talk to Barry about this as I speculate these are due to course differences.

You also can see which records are the most "impressive" and which records are "soft". The impressive records are ones where the plotted point is far below the best fit line and soft receords are ones considerably above the best fit line. The women seem to be less consistent - their points are generally farther from their best fit line than men's. One possible explanation for this is more men the race which would lead to the best men's times have been whittled down more.

A couple soft records seem to be:
Women age 65
Men age 84-86, 89
A few impressive records seem to be:
Women age 66 and 68
Men age 87, 88, 90
It also appears that there are some kind-of-soft records:
Women - around age 20. As I wrote above teens and late 20's women ran faster, maybe due to the faster women around age 20 being on college racing/training schedules?
Men - mid-30's, because men aged 38-41 ran faster with no reason to think that extra speed comes with the 38th birthday.


This 2nd graph works the same as the first except I cut off men at age 83. I did this because it's difficult to see the meat part of the graph due to the high ages causing the Y-axis to go so high. I arbitrarily picked age 83 because 84 is where the earliest occurance of the men's record being slower than the slowest women's record (which is women age 77) exists. If the graph has to go up to the women age 77, we might as well include all of the men's time up to that point.

Also on the 2nd graph the men's best fit line is not exactly the same as the 1st graph. This is because the men's age 84+ times are exllucded from the calculation. It's close but not the same. For handicapping purposes which I'll discuss in a post (hopefully next week) using the best-fit line on the 1st graph would be a better choice.


I further truncated the 2nd graph to develop the 3rd, in the same way I truncated the 1st to get the 2nd. Again, the best-fit line will not exactly match the above graphs because it considers only that graph's data. This graph shows the ages where the most competitive times are posted. I used ages 16 and 55 endpoints arbitrarily because that's where it appears the times start and stop being "real fast".

No comments: