by

Understanding Rock Climbers With Data (Part 2)

In part 1 of this series, I described my goal of understanding the rock climbing community with data. I also summarized my process of collecting and cleaning about 4 million climbing records (ticks). After cleaning my tick dataset and converting it into a table of climbers, I attempted clustering on the climbers, only to find that climbers are generally sorted by how much they climb (or possibly how much they use Mountain Project).

I think it’s possible there are more interesting clusters to be found in the data, but the results mostly depend on how I convert from ticks to users. And my initial climbers table lacked a critical component: geospatial awareness. While each tick is labeled with a route and climbing area, the geographic locations of those areas are not considered. Instead, the areas are simply represented as strings. For example, the popular High E area is represented as “New York > Gunks > Trapps > High E” in the dataset. The fact that California and New York are physically far apart is not considered.

Geospatial awareness could have a lot of implications. For example, I suspect distance and frequency of travel could contribute to clustering climbers by income. Additionally, the remoteness of climbing destinations could be a defining feature. Many climbers in my own circle have only traveled to easy-access areas like Rumney, Red Rocks, and The Gunks. Few venture to less traveled destinations like Ten Sleep and The Devil’s Tower.

In this post, I’ll describe how I added geotags to my tick dataset and visualized that data.

Adding Geotags To The Dataset

As mentioned above, each climbing route belongs to an area. Fortunately, most climbing areas on Mountain Project are annotated with GPS coordinates. Since both routes and areas have unique URLs, the route URL can be used to resolve the area URL. And since my ticks table contains route URLs, I was able to write a second Scrapy crawler that finds the GPS coordinates for each route. You can force your crawler to examine a specific set of URLs by using the start_requests method.

This crawler ran on my laptop for about four hours and eventually produced a CSV where each row was a route URL, area URL, and GPS coordinates. Areas for which GPS coordinates were missing were simply skipped.

Finally, I joined this areas CSV with my ticks CSV using the Pandas Dataframe merge function (inner join to make sure all remaining rows had GPS coordinates). This process, combined with the cleaning described in Part 1, left me with almost 3 million ticks. While this is substantially less than 4 million, it is still plenty to play with.

Exploring Geotagged Data

After achieving my first goal of geotagging ticks, I quickly realized that exploring geospatial data is hard without data visualization. Even simple questions like whether a tick resides in a certain state requires a lot of effort. The data becomes much more useful when displayed on a map. After exploring a few other mapping options like Matploblib’s Basemap and Leaflet, I discovered gmaps, a Jupyter plugin that uses the Google Maps API.

gmaps comes with some sophisticated tools for visualizing geotagged data. The heatmap feature was most relevant for my use case, and within a few minutes, I was able to visualize my ticks on a map.

A heatmap of ticks globally. It is unsurprising that most ticks are in the United States, since that is Mountain Project’s target demographic. Small spots can also be found in Thailand and Greece, likely due to US tourism.
A closer look at the hotspot in Thailand. The climbers are visiting a popular peninsula called Phra Nang/Railay Beach.
A closer look at the hotspot in Greece. Climbers flock to the Island of Kalymnos, a world-class destination according to Mountain Project.

Early on, I noticed that without any adjustments, Colorado burned far too bright for the rest of the world. Gmaps recommends solving this problem by setting a ceiling on the intensity of the maximum peak. This cap allows other destinations to show.

A comparison of heatmaps before and after implementing a maximum heat. In the top map, Colorado and a few other areas dominate. In the bottom map, other popular but not ridiculously popular areas show.

I decided to focus my exploration on North America (and the US in particular), since that’s where the vast majority of ticks are located. When the heatmap is placed alongside a map of the most popular climbing destinations, the correlation is obvious.

Exploring Geo Data Over Time

After creating an initial heatmap, I wanted to see what the data looked like over time. I was especially interested in seeing if migratory patterns could be visualized over the course of a year. I decided to create a heatmap for every week in 2019 and combine those images into a video.

In python, it’s easy to create a list of dates using datetime:

from datetime import datetime, timedelta
dates = []
datetime_object = datetime.strptime('Jan 1 2019', '%b %d %Y')
for i in range(52):
  dates.append(datetime_object)
  datetime_object += timedelta(days=7)

With that list of dates, the data can be sliced by applying a mask to the dataframe. For example:

mask = (df['Date'] > dates[0]) & (df['Date'] <= dates[1])
df_for_the_week = df.loc[mask]

For each of the 52 weeks, I created an image. Unfortunately this part was pretty laborious, since it is actually not possible to automatically download a gmaps map as an image (there is a Github FR for this). 52 is in the sweet-spot of possible but painful, so I decided to button mash instead of automating.

Next, I had to combine the images into a video. This was an interesting topic to research, since there are so many options. Many folks do this using paid software or half-baked, free tools online. Others use complicated video-editing applications. When I switched my searches to be more programmer friendly, I realized that this could be handled by ffmpeg, a powerful audio and video processing tool.

As it turns out, ffmpeg even has documentation dedicated to this problem, which they call slideshow. Add in some magical incantations to make the resulting slideshow compatible with QuickTime (see StackOverflow post), and you get the following command:

ffmpeg -framerate 1 -pattern_type glob -i '*.png' \
    -r 30 -pix_fmt yuv420p -vcodec libx264 \
    -s 984x384 output.mp4

This command synthesized my 52 images into a video, giving each image one second of playtime. I manually verified the order of the images, since I was a little skeptical of the glob keyword. As it turns out, I had a minor bug at first because the files were numbered, but ffmpeg orders them alphabetically (this will make 10.png come before 2.png).

For some added drama, I did a little bit of editing in iMovie to add labels for each week of the year. This allows you to more easily skip around in the video to look at particular weeks. Some of the images get cut off at the bottom, which is an ffmpeg quirk that I have not solved yet.

A slideshow of tick heatmaps week-by-week for 2019.

The video is pretty fun to watch, and I noticed a few interesting things about climbing patterns:

  • While areas like Red Rocks and Colorado have climbing year-round, cold areas like Squamish and Rumney as well as hot areas like El Portrero Chico have distinct seasons.
  • Michigan, where I have yet to visit, seems to have an especially short climbing season.
  • Colorado and Yosemite have vastly more climbers than other areas. They are consistently the hottest parts of the map.

Conclusion

Ultimately, this exercise taught me a lot about data processing, web scraping, and data visualization. It also made me appreciate the difference between software engineering and data science in terms of deliverables.

After spending a few hours on this project, I quickly amassed a mess of PNGs, unfinished videos, and Jupyter blocks. The deliverable is essentially this post, which will eventually be hard to correlate back to my materials as they get lost over time. Meanwhile, in software engineering the deliverable usually is the software (with documentation). The materials that made the project can’t really be lost, since they are baked in.

The project also made me feel a little more skeptical about data visualizations in general. Bugs can easily creep in during all the data chopping, and the product cannot be rigorously tested like a software project. The author needs to compare the visualization with their expectations of what it ought to look like. There may be more thorough verification methods, which I plan on investigating.

I still haven’t actually used my tick dataset to predict anything, which was one of my original goals. I also haven’t deeply explored my dataset for interesting correlations (for example between max redpoint grade and years climbing), which could reveal truths about climbing more easily than a predictive model. When I have time again, I’ll dive back in and try another exploration.

Write a Comment

Comment