Understanding Rock Climbers With Data (Part 2)

In part 1 of this series, I described my goal of understanding the rock climbing community with data. I also summarized my process of collecting and cleaning about 4 million climbing records (ticks). After cleaning my tick dataset and converting it into a table of climbers, I attempted clustering on the climbers, only to find that climbers are generally sorted by how much they climb (or possibly how much they use Mountain Project).

I think it’s possible there are more interesting clusters to be found in the data, but the results mostly depend on how I convert from ticks to users. And my initial climbers table lacked a critical component: geospatial awareness. While each tick is labeled with a route and climbing area, the geographic locations of those areas are not considered. Instead, the areas are simply represented as strings. For example, the popular High E area is represented as “New York > Gunks > Trapps > High E” in the dataset. The fact that California and New York are physically far apart is not considered.

Geospatial awareness could have a lot of implications. For example, I suspect distance and frequency of travel could contribute to clustering climbers by income. Additionally, the remoteness of climbing destinations could be a defining feature. Many climbers in my own circle have only traveled to easy-access areas like Rumney, Red Rocks, and The Gunks. Few venture to less traveled destinations like Ten Sleep and The Devil’s Tower.

In this post, I’ll describe how I added geotags to my tick dataset and visualized that data.

Adding Geotags To The Dataset

As mentioned above, each climbing route belongs to an area. Fortunately, most climbing areas on Mountain Project are annotated with GPS coordinates. Since both routes and areas have unique URLs, the route URL can be used to resolve the area URL. And since my ticks table contains route URLs, I was able to write a second Scrapy crawler that finds the GPS coordinates for each route. You can force your crawler to examine a specific set of URLs by using the start_requests method.

This crawler ran on my laptop for about four hours and eventually produced a CSV where each row was a route URL, area URL, and GPS coordinates. Areas for which GPS coordinates were missing were simply skipped.

Finally, I joined this areas CSV with my ticks CSV using the Pandas Dataframe merge function (inner join to make sure all remaining rows had GPS coordinates). This process, combined with the cleaning described in Part 1, left me with almost 3 million ticks. While this is substantially less than 4 million, it is still plenty to play with.

Exploring Geotagged Data

After achieving my first goal of geotagging ticks, I quickly realized that exploring geospatial data is hard without data visualization. Even simple questions like whether a tick resides in a certain state requires a lot of effort. The data becomes much more useful when displayed on a map. After exploring a few other mapping options like Matploblib’s Basemap and Leaflet, I discovered gmaps, a Jupyter plugin that uses the Google Maps API.

gmaps comes with some sophisticated tools for visualizing geotagged data. The heatmap feature was most relevant for my use case, and within a few minutes, I was able to visualize my ticks on a map.

A heatmap of ticks globally. It is unsurprising that most ticks are in the United States, since that is Mountain Project’s target demographic. Small spots can also be found in Thailand and Greece, likely due to US tourism.
A closer look at the hotspot in Thailand. The climbers are visiting a popular peninsula called Phra Nang/Railay Beach.
A closer look at the hotspot in Greece. Climbers flock to the Island of Kalymnos, a world-class destination according to Mountain Project.

Early on, I noticed that without any adjustments, Colorado burned far too bright for the rest of the world. Gmaps recommends solving this problem by setting a ceiling on the intensity of the maximum peak. This cap allows other destinations to show.

A comparison of heatmaps before and after implementing a maximum heat. In the top map, Colorado and a few other areas dominate. In the bottom map, other popular but not ridiculously popular areas show.

I decided to focus my exploration on North America (and the US in particular), since that’s where the vast majority of ticks are located. When the heatmap is placed alongside a map of the most popular climbing destinations, the correlation is obvious.

Exploring Geo Data Over Time

After creating an initial heatmap, I wanted to see what the data looked like over time. I was especially interested in seeing if migratory patterns could be visualized over the course of a year. I decided to create a heatmap for every week in 2019 and combine those images into a video.

In python, it’s easy to create a list of dates using datetime:

from datetime import datetime, timedelta
dates = []
datetime_object = datetime.strptime('Jan 1 2019', '%b %d %Y')
for i in range(52):
  dates.append(datetime_object)
  datetime_object += timedelta(days=7)

With that list of dates, the data can be sliced by applying a mask to the dataframe. For example:

mask = (df['Date'] > dates[0]) & (df['Date'] <= dates[1])
df_for_the_week = df.loc[mask]

For each of the 52 weeks, I created an image. Unfortunately this part was pretty laborious, since it is actually not possible to automatically download a gmaps map as an image (there is a Github FR for this). 52 is in the sweet-spot of possible but painful, so I decided to button mash instead of automating.

Next, I had to combine the images into a video. This was an interesting topic to research, since there are so many options. Many folks do this using paid software or half-baked, free tools online. Others use complicated video-editing applications. When I switched my searches to be more programmer friendly, I realized that this could be handled by ffmpeg, a powerful audio and video processing tool.

As it turns out, ffmpeg even has documentation dedicated to this problem, which they call slideshow. Add in some magical incantations to make the resulting slideshow compatible with QuickTime (see StackOverflow post), and you get the following command:

ffmpeg -framerate 1 -pattern_type glob -i '*.png' \
    -r 30 -pix_fmt yuv420p -vcodec libx264 \
    -s 984x384 output.mp4

This command synthesized my 52 images into a video, giving each image one second of playtime. I manually verified the order of the images, since I was a little skeptical of the glob keyword. As it turns out, I had a minor bug at first because the files were numbered, but ffmpeg orders them alphabetically (this will make 10.png come before 2.png).

For some added drama, I did a little bit of editing in iMovie to add labels for each week of the year. This allows you to more easily skip around in the video to look at particular weeks. Some of the images get cut off at the bottom, which is an ffmpeg quirk that I have not solved yet.

A slideshow of tick heatmaps week-by-week for 2019.

The video is pretty fun to watch, and I noticed a few interesting things about climbing patterns:

  • While areas like Red Rocks and Colorado have climbing year-round, cold areas like Squamish and Rumney as well as hot areas like El Portrero Chico have distinct seasons.
  • Michigan, where I have yet to visit, seems to have an especially short climbing season.
  • Colorado and Yosemite have vastly more climbers than other areas. They are consistently the hottest parts of the map.

Conclusion

Ultimately, this exercise taught me a lot about data processing, web scraping, and data visualization. It also made me appreciate the difference between software engineering and data science in terms of deliverables.

After spending a few hours on this project, I quickly amassed a mess of PNGs, unfinished videos, and Jupyter blocks. The deliverable is essentially this post, which will eventually be hard to correlate back to my materials as they get lost over time. Meanwhile, in software engineering the deliverable usually is the software (with documentation). The materials that made the project can’t really be lost, since they are baked in.

The project also made me feel a little more skeptical about data visualizations in general. Bugs can easily creep in during all the data chopping, and the product cannot be rigorously tested like a software project. The author needs to compare the visualization with their expectations of what it ought to look like. There may be more thorough verification methods, which I plan on investigating.

I still haven’t actually used my tick dataset to predict anything, which was one of my original goals. I also haven’t deeply explored my dataset for interesting correlations (for example between max redpoint grade and years climbing), which could reveal truths about climbing more easily than a predictive model. When I have time again, I’ll dive back in and try another exploration.

Understanding Rock Climbers With Data (Part 1)

By now, being a software engineer and a rock climber is somewhat cliché. This archetype is usually explained by the cerebral similarities between tricky climbing sequences and programming problems. My more cynical theory is that many software engineers avoided traditional sports in their youth, and climbing offers a friendly path to adult fitness. Whatever the cause, climber-engineers are in luck because climbing produces a lot of interesting data, allowing these folks to combine hobbies.

The climbing community creates a lot of technology. Prominent sites like Mountain Project, SuperTopo, and Gunks Apps immediately come to mind, but app stores are also littered with less polished projects. Add in the online technical commentary about gear, and you have quite a collection. To add to the pile, I decided to see if I could understand the climbing community better with data.

Framing The Data Problem

As mentioned above, Mountain Project is probably the most-recognizable climbing app. It allows users to find climbing routes, review them, and contribute content like photos or helpful descriptions. It’s common to see folks at the base of a cliff scratching their heads and looking between their phone screen and the wall. These are climbers attempting to correlate a particular skinny tree on their phone with the real thing.

When climbers complete a climb, they can create a Mountain Project “tick”. The tick serves as a record of the climb, and it includes metadata like the location, difficulty, how the climber performed, and freeform notes. Ticks are available publicly (see mine), and climbers often use tick histories to search for partners.

I decided to capture some tick data and see what kind of questions I could answer about the climbing community. The analysis is ongoing, but here are some goal I have in mind:

Predicting Climbing Popularity

Climbing is exploding as a sport, with new gyms cropping up all over the United States. Many of these newcomers eventually explore outdoor crags. Predicting this outdoor traffic could help parks prepare for future demand. If I could predict popularity based on location, then I could also possibly avoid the crowds! Since ticks are associated with dates, this can be framed as a times series forecasting problem.

Recommending Climbing Partners

As I mentioned above, many Mountain Project users scour the app looking for partners based on ticks. This mundane task could potentially be automated with a recommender system. Recommender systems typically compare users using their choices on a platform (Spotify songs, Netflix shows, etc). We can model ticks as these choices, which enables us to recommend either climbing routes or climbing partners. Note that this assumes that two climbers that are very similar would be good climbing partners, which is an assumption I am making based on experience.

Understanding Climbing Archetypes

If we can compare climbers by their tick histories, then we can also try to segment them. Businesses often perform cluster analysis on their customers to try and understand the different personas they attract. In my case, I just think it would be fun to see if I can find hard evidence of climbing myths like the Gumby, the Trad Dad, the Rope Gun, or the Solemn Boulderer.

Getting the Tick Data

So of course, first I had to get some tick data. Note to that respect Mountain Project’s terms of service, I do not post the complete code I used to do this or the data itself.

Perusing the site, I noticed that each user’s ticks page has a handy “Export CSV” button, which downloads a CSV of their ticks! Using a powerful Python web scraping tool called Scrapy, I was able to cobble together a crawler that looks for ticks pages and downloads a CSV for each one. If you want to try this out, remember two things:

  • Be nice to the site you are scraping and use Scrapy’s AutoThrottle
  • Use Scrapy Jobs so that your crawler can start and stop

My crawler ran for about 4 days straight on my laptop, eventually completing with about 83K CSV files! Using my expert StackOverflow search skills, I found this answer to help me combine them into a single CSV file.

Tasting the knowledge ahead, I rushed to get this CSV into a queryable format. My first idea was to stage the file on Google Cloud Storage and import it into BigQuery for exploration. BigQuery supports this feature, so I thought this would be trivial, but I was naive to a major peril: data cleaning.

Cleaning the Data For Import

When BigQuery tries to ingest a CSV, it fails upon encountering errors (you can configure how many errors to allow before failure). These errors often refer to a specific position in the file. When I encountered these errors, I jumped to the position in the file by opening it in vim (slow if the file is large) and jumping there with goto.

In my case, I learned that some of the ticks contained a carriage return character. This character is actually somewhat difficult to create on a Mac, but ultimately I was able to simply remove it from the file using vim regex commands. I got lucky: this was all that I needed for BigQuery to accept the data.

Exploring the Data

At long last, tick data was at my fingertips! I started by querying some fast facts to understand what I was working with:

  • 3,849,902 ticks
  • 114,992 routes
  • 85,143 users
  • 27,433 crags

Next, I wanted to answer some questions along a variety of dimensions. Check out the captions for assessment of each image.

Most popular crags. Unsurprisingly, Colorado dominates the list. The Gunks has likely made its way there due to proximity to New York City. Little Rock City is a popular bouldering crag, so there are probably many more ticks per-capita. Lastly, Horseshoe Canyon Ranch likely gets a boost from the Horseshoe Hell event.
Ticks and users over time. Clearly the Mountain Project community has gotten a lot more active! This probably reflects overall trends in the climbing community.
How climbers split their time between the three main disciplines. Though the relative increases at the tails are not huge, they do show a tendency for sport climbers to stick to sport climbing and for trad climbers to be the most diversified. The bouldering extremes may be distorted by the fact that Mountain Project does not cater as well to bouldering.

Tick distribution across users. The vast majority have not made many ticks, while a few outliers have created a few thousand. It’s not clear whether this means many climbs go un-ticked, or the vast majority of climbs are completed by a small group.

Just for fun: a word cloud of text used in tick notes. Note that “OS” and “O” are referring to onsight (when the climbers sends the climb on the first try, with no prior information).

Attempting Climber Segmentation

I decided that first I would see if I could discover climber archetypes. I chose this one first because it was fun and because BigQuery supports k-means clustering out of the box. While k-means isn’t the only clustering method or necessarily the best one for this task, I figured it was low hanging fruit.

The first problem I encountered was that I had a table of ticks but I wanted to cluster users. I needed a way to map ticks to users. Based on some research and advice for friends, there was actually no standard procedure for this.

In an example where companies are clustering based on stock data, a column is created for every day, where the values are the changes in stock price for each company. When I looked at the RFM technique, commonly used for user segmentation, I found that “categories may be derived from business rules or using data mining techniques to find meaningful breaks.” In this assessment of Instacart users, Dimitre Linde describes the features he builds from the purchase data. It seems like the real art of clustering comes from the feature extraction.

I decided that I wanted to understand climbers by both how often they do different activities and which activities they do. I also thought about the personalities I suspected and tried to tailor the columns to them. Ultimately I settled on the following categories: months climbing, number of ticks, number of ticks on a weekday, number of ticks on a weekend, number of trad climbs, number of sport climbs, number of boulder routes, number of multipitch climbs, number of bigwall climbs, hardest grade climbed, mean grade climbed, number of locations they ticked, number of states they climbed in.

Note that to make climbing grades comparable, I converted from the US system (which has numbers and letters) to the Australian system (which has ordered numbers) using this chart. Bouldering grades can also be converted to this system.

Unfortunately, my results were somewhat disappointing. I performed k-means clustering for 2, 3, and 4 clusters, but in all cases, the clusters clearly broke down by climbing time. BigQuery shows the centroid value for each feature, allowing me to get a feel for the meaning of the clusters.

Centroids for clustering with three clusters. The centroids break down by “small”, “medium”, and “large” for every feature.

It’s still not clear whether this was due to poor feature development or whether this is genuinely the best way to segment climbers. After all, anecdotally differences in volume do seem very meaningful among my climbing friends.

I tried a second experiment where I used percentages for the columns instead of absolute numbers to eliminate the differences from simply climbing more. This time, users seemed to segment mostly by time multipitching and hardest grade.

Using percentages, we can see that the biggest discrepancies between centroids appear for percent bigwall, percent multipitch, and max rating. This could be the a glimmer of “the sport climber”, “the trad dad”, and “the all-around climber”, but it is hard to say definitively.

Overall, I wouldn’t say clustering has yielded anything very meaningful yet. From my reading it seems notoriously fickle, since it is totally unsupervised. Next steps would be to attempt PCA so that I can visualize these clusters and see how logical they look. I may also try to derive more complex features like “how far they travel” and “how often they go on climbing trips”.