How Open Messaging Improves Engineering Quality

I recently started a new job that communicates on Slack instead of email. This may seem like a small difference, but in fact it is a paradigm shift. Put simply: email encourages private conversations, and Slack encourages public ones.

Most Slack conversations happen in public channels, while most emails happen in limited groups. Even when a Slack channel has just a few members, it is still public to the entire company. Emails, on the other hand, are usually totally private or restricted to groups that are technically public but not easily discoverable. (Note that making Slack channels private and emailing everyone all the time are both possible but atypical workflows.)

This openness actually improves engineering quality because it steers discussions towards convergence on the correct or most reasonable outcome. In any engineering organization, folks regularly examine and question past decisions. Discussions sometimes invalidate existing beliefs or designs, and that often means that someone had a misunderstanding. Even though good engineering cultures do not punish people for honest mistakes, it still isn’t fun to be wrong. So when the truth is ambiguous, even the best of us sometimes cling to our bogus theories to save face.

With email, resisting a changing situation or new facts can create a long and painful thread. The engineer who discovers a discrepancy and the original owner will go in circles without hearing each other, especially if the inquirer has less organizational power than the owner. The thread will continue looping in the shadows unless someone escalates. And since escalation can feel uncomfortable or confrontational, it often doesn’t happen. The conversation dissolves into the digital ether.

With an open messaging tool (usually Slack), engineers communicate in a town square, which incentivizes correctness over defensiveness. Critically, the town square contains the company’s most distinguished members. Folks with organizational power peruse the channels and take a moment to lend their privilege when they notice something important. While privilege-lending is usually discussed in the context of diversity, it also applies when those with organizational power (title, subject matter expertise, or tenure) put their influence behind someone who lacks those credentials.

This example (which is fabricated, please follow established migration patterns!) shows how a nudge from an organizational leader lends credibility to a newcomer. If leaders do this a few times, then the focus on merit persists even without explicit input, since their presence in the public square is enough.

Lastly, I’ll say that some readers may feel that this is more about big vs small tech rather than public vs private messaging. My hunch is that size and openness are inversely correlated, since companies face legal and compliance pressure to reduce their transparency as they grow. So, the exciting startups of today will need to navigate this tradeoff as they grow into the establishment of tomorrow. Hopefully they will consider the subtle benefits of embracing openness.

Consequences Create Complexity

As an engineering manager of a small team, I sometimes find myself trying to articulate why a solution is too “hacky”. Hackiness is a nebulous concept in software engineering, but it presents itself with some tell-tale signs:

  • Hacky solutions often have a much smaller time estimate than the original gut estimate from leadership
  • Hacky solutions are often praised by hackers but strike vague fear into the hearts of architects
  • Hacky solutions often procure watery, quasi-religious counter arguments about doing the right thing in engineering

First, let me clarify what I mean by “hacker” and “architect”. In software engineering there is a spectrum between hacker and architect. Hackers derive solutions from experimentation with technology and a deep understanding of its properties. Their solutions are less concerned with best practices than with subtle affordances provided by the underlying technology. Architects derive solutions from software engineering best practices. They will find or create technology to implement these best practices, ignoring any shortcuts or hidden boons from technological details. This short post offers a bit more detail, but the point is that they have a yin-yang relationship. Most engineers skew to one side but do not like to be labeled as either.

In big tech companies, hacking is often associated with being junior while architecture is associated with being senior. This creates a pattern where a junior engineer presents a design with some amount of hackiness, and they receive endless pushback from senior teammates. The interesting part is that the arguments are usually not over whether to hack. Instead, they are about whether the proposal is even hacky. The challenge is coming to agreement on a hackiness appraisal.

Describing hackiness is not easy. Engineers sense hackiness with an intuition that has developed over years of experience. Distilling that feeling into words can manifest as clumsy objections about why the solution won’t scale. These points often feel too hypothetical and distant from the problem at hand. Hacker types respond to these criticisms with explanations of why the solution can be adjusted for high scale later on. They discount the future suffering from rebuilding a live system. And because software engineering has a very collaborative culture, the discussion can continue for a long time.

Similar to how “hacker” can be used as an insult, architects have their own kryptonite: accusation of over-engineering. Over-engineering is a company killer, and it can be downright sinful. In its worst form, over-engineering is an indulgence by engineers who use company dollars to educate themselves in complex systems. As legendary technical manager Andy Grove points out, the tuition for learning from unguided mistakes is paid by the customer. Even without malice, over-engineering implies that an engineer cannot actually solve business challenges, relegating them to a career of ticket-wrangling without agency. A hacker can often deploy the over-engineering label to retort criticisms of a quick and dirty solution.

Another complicating factor is the bias of each side. The architect and the hacker share a desire to succeed, but they have different success criteria. The hacker often evaluates themselves based on shipping prodigious amounts of code and getting features into production quickly. The architect often prizes behemoth design documents and execution of long-term plans. These desires can cloud their judgment of which approach is best for the problem at hand.

If only we could find some criteria to assess hackiness and spare everyone from this cycle! Unfortunately, the haziness of the term is baked into its definition:

Hacky: 2. (of a piece of computer code) providing a clumsy or inelegant solution to a particular problem.

If we squint at this definition, we can see that the wiggle room lies in the particular problem. This suggests that having a better way to define a problem can result in a shared understanding of its acceptable hackiness. Too often, engineers point to a problem’s complexity as justification for a commensurately complex solution. Instead, we should focus on its consequences.

The complexity of a solution is proportional to the problem’s consequences.

For a given problem, there is often a set of solutions with varying complexity. All of the solutions technically solve the problem, but only one is most appropriate for the consequences. Focusing on these consequences shifts attention from what is best in a general sense to what is best in this instance.

Consider parallel parking. Parallel parking is one of the most difficult parts of driving. Unskilled drivers commonly bump the adjacent cars as they maneuver into a spot. But based on the number of scuffed up cars in my Brooklyn neighborhood, people don’t seem to mind the bumps too much. If they did, then more drivers would either pay for a garage or equip their car with a bumper protector. Given the low consequence of a bump here and there, it seems reasonable that we solve parallel parking with some basic training and a brief exam.

Now, imagine that a parked car could easily explode when bumped. All else equal (bear with me on ignoring secondary effects), parallel parking would be a lot scarier. Giving teenagers a quick test and sending them onto the roads would be egregiously careless. Instead, we would pursue a more complex solution. Some that come to mind are automatic parallel parking, parallel parking as a service, or subsidizing construction of more parking garages.

Second, consider securing an asset from theft. If the asset is my favorite hacky sack, I might hide it in a shoe box under my bed. Nobody really wants the hacky sack, and the chance of a home invasion is pretty low. Now suppose that the hacky sack is actually a sack full of diamonds. I would have to upgrade to a defense-in-depth strategy that would take the likes of an Ocean’s 11 heist to be compromised. And these protections still would not employ even more comprehensive measures like geosharding the diamonds, so it would be vulnerable to natural disasters or an evil sovereign state.

When going through this exercise with your own examples, it’s important to remember two caveats:

  • This idea only applies to comparing solutions for a specific problem. What is hacky for one problem is over-engineering for another.
  • The problem needs to have at least two solutions to compare anything. Unsolved problems like time travel or problems with only one known solution do not apply.

Assessing Consequence

So if we can assess a problem’s consequence, then we can use that as a proxy for its acceptable hackiness. Broadly, we can consider a few different axes for assessing a problem’s consequences: scale, cost, permanence, and risk to life. In my experience, engineers tend to focus on scale while discounting the other three.

Scale: The expected usage of the solution, usually related to user engagement over a time interval. For most engineering projects, this is the most obvious factor. The most common problem I have seen is underestimating scale, especially because the business usually cannot tolerate halting operations when the anticipated scale is exceeded.

Cost: The amount of time, money, or human resources to be invested in a solution. Engineers are often very distanced from a project’s costs, the bulk of which may actually be their own employment!

Permanence: The degree to which the solution cannot be modified after its initial delivery. Permanence can be subtle and overlooked. The solution may need to support old versions of itself for many years, especially if dictated by business contracts. Additionally, deploying a new version into the wild can take a very long time (i.e. patching client-side software).

More cynically, note that permanence may be underestimated because the time horizon is longer than the engineer’s tenure at the company! I believe that this does happen, but blame lies more on the incentive structure of the organization than on the employee. This tangent probably deserves its own post.

Risk To Life: The likelihood that the solution’s failure could damage life (human, animal, or even plant). Many engineering projects have very low risk to life, but tail risk must be carefully considered. For many solutions, even a single fatality can be disastrous for the business (see Peloton treadmill death or Tide Pod deaths).

These four factors can be used to anchor a discussion about acceptable hackiness. Instead of devolving into a holy war about the right way to be an engineer, arguments can be framed in terms of the problem’s consequence profile. I suspect that this line of argument may be especially effective for convincing junior engineers because they usually have very little experience working on high-consequence problems. Pointing out the vast difference in consequence between at-home or academic projects and the solution at hand can illustrate the need for more rigor.

By the same token, these factors provide justification for a small startup’s fast-paced decisions. If the solution has few users, low cost, can be easily rewritten, and does not endanger lives, then by all means hack away!

Conclusion

The lens of consequentialism offers an informative way to assess solutions. In particular, the amount of redundancy in a system is a textbook tradeoff between cost and reliability. If a crypto enthusiast says you should use a Coinbase wallet and Vitalik Buterin says you should have multiple wallets, they can actually both be right. It really depends on what is at stake.

Next time you need to walk the tightrope between hackiness and over-engineering, consider the consequences! Hopefully you’re working on something impactful enough that failure will affect someone somewhere, but you probably aren’t playing Squid Game.

Thanks to Stephen Malina and Corinne Hardy for a careful review of this post.

DIY Coinbase Portfolio Metrics with AWS

TLDR: To track my Coinbase portfolio for free, I Dockerized an existing script that generates portfolio metrics and ran it continuously on AWS. Read on to learn about my motivations and the process. Code is located on Github. Thanks to Matias B for providing the original tool.

I generally have a good experience using Coinbase to HODL cryptocurrency, but it is completely missing portfolio metrics. Portfolio metrics may not seem important when you buy your first batch of crypto. But after compulsively refreshing the Coinbase app to watch crypto prices yo-yo, you will ask yourself a simple question.

Am I up or down?

At first, you may be able to answer this question before that bead of sweat fully forms on your forehead. Maybe you simply remember the price at which you bought the crypto. Maybe you find Coinbase’s transaction email notification. Maybe you learn about getting your transaction history via Coinbase Reports. Either way, you’re able to determine whether you’re up or down, and life goes on.

That is, life goes on until your portfolio becomes more complicated. As you buy and sell various cryptocurrencies at various prices, this question becomes more elusive. Eventually, it will become impossible without some tooling.

Understanding your portfolio’s performance is really important for two reasons:

  1. You need to understand your unrealized gains/losses to profit from trading crypto assets.
  2. You need to include your crypto earnings in your taxes.

When I failed to find Coinbase portfolio metrics, I was convinced that I was missing a hidden tab in the Coinbase interface, stumped by a clunky user experience. Then I learned that I was not alone. I found multiple discussions about this hole in the crypto trading journey. It was pandemonium. Reddit power users lamented that noobs did not have enough braincells to instrument an auto-updating performance spreadsheet, while official Coinbase documentation shoved users towards a paid service called CoinTracker.

I discovered that CoinTracker offered a shiny interface out of the box, but most features would cost at least $15/mo, and the tax report would be at least $48 for anyone with over 25 transactions. Meanwhile, a Coinbase competitor called Voyager seemed to offer better metrics, but I did not feel like moving all of my crypto holdings just yet. Instead, I wanted to find a free, DIY solution to get some high-level portfolio metrics.

Eventually, I found a very detailed blog post offering a Python script that could read information from my Coinbase account and dump the results in a Google Spreadsheet. I was delighted to find that the whole thing worked as expected.

But those metrics were frozen in time, and I wanted to continuously update the data without entering a daily terminal command. While there are lots of ways to do this, I decided to use AWS to continuously update my portfolio metrics. This approach would not require my own computer to be online, and it would dovetail nicely with my separate goal of gaining more AWS experience (see prior post).

Continuously Updating Portfolio Metrics on AWS

To continuously update my Coinbase portfolio metrics, I would require just two components:

  • A scheduler that continuously triggered a portfolio update
  • An AWS function or module that would run the original script

For the scheduler, I chose to use AWS EventBridge. For running the script, I decided to package the code into a Docker container and run it with AWS Fargate on AWS Elastic Container Service. The whole setup looks like this:

This setup is pretty straightforward, but I will describe two more interesting pieces of the instrumentation in detail: improving the security of the script and debugging EventBridge failures.

Improving the Security of the Script

While the original script works, it has some security vulnerabilities that make it undesirable for the cloud:

  • It puts the Coinbase API Key and Secret directly into the Python code. You should generally never put sensitive information like passwords or keys directly in code. There are a lot of good reasons for this, and the link explains them better than I can.
  • It passes the Google Service Account credential to the script via a local file. Whoever possesses this credential has the ability to act as the service account. While nobody can access the file on a local machine, they actually can access the file once it is packaged into a Docker container. So, passing the credential via file works locally but not on AWS.
  • The Google Service Account is given Project Owner permissions in the original instructions. This is a very powerful permission that allows the Service Account to do anything possible in Google Cloud. In the wrong hands, the credential could be used to spin up virtual machines and mine crypto, leaving you with a staggering credit card bill. Some lucky bandits would eventually benefit from this post, but that is beside the point.

The first two problems can be solved with environment variables and AWS Parameter Store. Environment variables allow a developer to place secret information outside of the code. The code can still dynamically fetch this information via an interface. In Python, the language used for this script, it looks like this:

import os

key = os.environ.get('COINBASE_KEY')
scrt = os.environ.get('COINBASE_SECRET')
google_creds = os.environ.get('GOOGLE_COINBASE_CREDS')

When running the update script on your local machine, which is useful for development, you can simply make sure these variables are exported.

export COINBASE_KEY="XXXXXX"
export COINBASE_SECRET="YYYYYY"

On AWS, we can put the secret information into the Parameter Store. Then, when we set up our ECS Task Definition, we can use Parameter Store to pass the environment variables to the Docker container.

To make passing these variables as secure as possible, we can restrict the System Parameter permissions given to the Task Execution role (which executes the Docker Container on ECS) to include only these specific parameters. This granularity is nice because it allows us to follow the principle of least privilege when setting up the ECS task role.

Note that to securely pass the Service Account credential, I dumped the entire JSON blob into a System Parameter value. This string can be used instead of passing a file to the code. Because I found this change somewhat tricky to get right, I think it is worth pasting the code sample:

import json
import os
from google.oauth2.service_account import Credentials

google_creds = os.environ.get('GOOGLE_COINBASE_CREDS')
scopes = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
service_account_info = json.loads(google_creds)
credentials = (Credentials.from_service_account_info(service_account_info, scopes=scopes))

Remember, the final issue was the sweeping Project Owner role that was granted to the Service Account. It turns out that this role is totally unnecessary. It is a confusing story, but the permissions model used by Google Cloud (Google Cloud IAM) is completely separate from the permissions model used by Google Apps (Drive Sharing). The Service Account is actually given permissions to write to the Google Sheet via Drive Sharing (the little share button in the Docs UI). It will successfully write to the spreadsheet even with no IAM permissions at all! So, this issue is easily solved by giving the Service Account zero IAM permissions. This ensures that even if a hacker got ahold of the credential, the worst they could do is overwrite your Coinbase Portfolio spreadsheet. More details on these two permission models can be found here.

Debugging EventBridge Failures

After setting up my ECS Task Definition and EventBridge Rule, I sadly discovered that my spreadsheet was not being updated. I poked around in the EventBridge metrics, but all I could find was a graph of failed invocations over time.

This graph let me know that something was wrong, but it did not answer the obvious next question of why. As I dug around AWS and searched the web, I found that many users shared this consternation. Some claimed that scheduling a Fargate task from EventBridge was simply not possible. To make matters even more confusing, it seems that scheduling Fargate on EventBridge only got support in 2018, so it actually was impossible in the past.

Fortunately, I learned that AWS CloudTrail retains an event history, and I was able to find instances of the RunTask event that is triggered by the EventBridge scheduler.

Clicking one of the RunTask entries revealed a lot more detail about the event, including a very suspicious error message.

"errorCode": "InvalidParameterException",
"errorMessage": "Network Configuration must be provided when networkMode 'awsvpc' is specified.",

I learned that Fargate always runs with awsvpc network mode, and this means a corresponding network configuration must be provided in the EventBridge Rule (more details in this helpful post from Jimena Mora). It turns out that the EventBridge UI subtly hints that you must specify network configuration when using awsvpc, but it does not spell out that this always applies to Fargate, and it certainly does not use this information for form validation.

And so finally, with correct network configuration and secure variables, I discovered that my Coinbase portfolio was updated continuously! Mission accomplished.

Conclusion

This exercise taught me a lot about AWS debugging and Docker best practices. Some takeaways I had:

  • The AWS UI is pretty clunky, but it usually has what you want, somewhere…
  • Crypto taxes actually seems pretty complicated, so I would still consider paying for the CoinTracker tax report
  • The best way to avoid all of these shennanigans is to simply keep HODLing!

Finally, I should note that while my goal was to make a tool that was completely free, I have observed that the continuous updates seem to cost a few cents a day, so I did not entirely meet my goal. But, I think this is a good compromise for the learning I got and the decoupling of the job from my personal computer.

Postgres, Kafka, and the Mysterious 100 GB

Lately I’ve been trying to learn more about Amazon Web Services (AWS) and some other popular technologies. Since I work at a large tech company, I spend a lot of my professional time using proprietary software that is analogous to but very different from what most folks would use for the same task.

In this post, I’ll talk about a surprising challenge I encountered setting up an event-driven architecture proof-of-concept on AWS and how I debugged it. I think the experience shows that PoCs using “production infrastructure” can expose pitfalls that might appear in a real implementation. Please read on if you’re interested in hearing about how I accidentally paid AWS too much money and desperately tried to expose a nonexistent hacker.

In particular, I got interested in learning more about Apache Kafka, since it seemed like an effective way to communicate between micro-services. It turns out that Apache Kafka is offered on AWS as a managed service called MSK. And, since I learned that Kafka can siphon events from changes to a database via Kafka Connect, I decided to try the following setup:

  • Deploy an AWS RDS Postgres instance
  • Deploy an AWS MSK instance
  • Deploy a Kafka Connector that would send database change events to Kafka
  • Deploy a Kafka consumer ready to read database change events
  • Finally, observe that inserts to a database table result in events for the consumer

This seems like a lot of machinery, but an architecture diagram shows that it is relatively simple:

Note that MSK actually comes with Managed Connectors, but I had a hard time getting this to work (and it is significantly more expensive than running your own connector). So, I decided to use an open-source connector called Debezium, which supports Postgres out of the box.

After slapping together some tutorials, I had a local version of this architecture running (using Docker Compose to manage each of the containerized components). Then, I was able to create the same environment on AWS, as described in the diagram. Finally, in preparation for the Thanksgiving holiday, I turned down my ECS Services so that I could reduce charges. I left MSK and RDS up, since they took some time to configure properly.

The Problem Starts

I returned a few days later, located my saved CLI commands for turning up the ECS services, and started to pick up where I left off. Unfortunately, I noticed that the service tasks were failing because they could not connect to the database. I also observed that I could no longer connect to the database locally, so I went to the RDS dashboard. There, I discovered that my instance was in a non-responsive mode called “STORAGE_FULL”.

STORAGE_FULL meant that I had somehow filled up (and was currently paying for) 100 GB of disk space. I started to sweat, thinking that I had probably been hacked by hooligans who were turning my instance into some crypto mining machine. These hackers must have been pretty good, since my instance was in a Security Group that only allowed traffic from my IP address. But despite this precaution, I did find evidence of write traffic to the database during my break!

The next step was to inspect the instance itself and try to see if anything weird was there. It’s actually not even possible to check the instance until it has enough storage to get out of STORAGE_FULL mode. An AWS Support doc helped me get my database enough memory to log on and pointed me in the right direction of the culprit: the transaction logs. The corresponding graph clearly showed that my memory had been steadily eaten up by growing transaction logs.

The AWS docs go on to say that such an increase can be caused by replication slots, which “retain the WAL files until the files are externally consumed by a consumer” (more on WAL logs soon). And then, their suggested query to reveal the source of the logs increase showed something very suspicious:

At this point, I could reasonably conclude that rather than hackers, my misuse of Debezium led to this issue. And, since the problem was ongoing, it would probably even fill the new storage I have allocated to the database. So, I had to move from discovering the issue to understanding it.

Why Was This Happening?

To understand why something related to Debezium was using up all of my storage for a replication slot, we need to take a step back and learn a little more about how the connection between Postgres and Debezium works.

Debezium gets Postgres change events by reading the Write-Ahead Log (WAL), a file that contains all recent changes to the data. This WAL allows the database to recover from failure, since any changes to the data are only written after they have been successfully captured in the WAL. By default, the WAL does not grow larger forever. Instead, Postgres intelligently recycles WAL segments after they have been included in a checkpoint (and using a number of other configurable factors described in WAL Configuration).

In order for Kafka to guarantee that it processes every event, it requires WAL events to persist until they are consumed by the Kafka Connector. Otherwise, the Kafka Connector could fall so far behind that it would permanently lose certain events. These events waiting to be consumed are placed in a replication slot specific to the connector. So, doing something like turning off your Kafka Connector forever could result in your WAL events sadly piling up in the replication slot, hoping to one day be consumed…

And yet, why were new events getting added to the WAL at all, since the database was unused? And what was causing that mysterious write traffic? Fortunately, I found the answer to this final question after I carefully read Debezium’s section on WAL Disk Consumption. There, at the bottom of the section, was a very special quote inscribed in a “tip” box:

Just for confirmation, looking at my database traffic with one-minute granularity does reveal constant, small traffic every 5 minutes.

At last, we have complete information:

  • AWS RDS periodically writes to the database at all times
  • By default, a replication slot continues to grow until its contents are consumed by a client
  • Debezium requires a replication slot for getting events into Kafka
  • When Debezium goes down, the replication slot endlessly grows in size

The Fix

So now that we know the problem, how can we fix it? Clearly, we do not want to live in a world where a failed Kafka Connector can result in a complete outage of our database. This could be a significant risk for a production application.

Stopping the Bleeding

In the short term, I needed to get that disk space back. This is as straightforward as simply dropping the huge replication slot:

psql=>select pg_drop_replication_slot('debezium');

Note that turning Debezium back on and letting it catch up could also reduce the size of the slot.

Preventing A Recurrence

At a certain point, we have to be comfortable with the fact that if Debezium is down for long enough, then it will permanently miss some events. This is certainly better than bringing the whole database down instead. To do that, we can put a ceiling on our replication slot size using max_slot_wal_keep_size. This was the solution I came up with after searching through all RDS params with “wal” in them, but it is also endorsed by a few other sources.

Finally, we could also bolster our defenses by setting an alert for whenever the transaction logs grow too large. This is described in more detail in this post.

Concluding Thoughts

I’m glad I went through this exercise because it helped me scratch the surface of maintaining production infrastructure for Postgres and Kafka. These systems are definitely not meant to be intermittently used during development. Instead, they expect to be hyperactive and online, as they should!

I was also reminded of the fact that any cloud provider always introduces an extra layer of complexity between the developer and the underlying technology. In this case, a special RDS property broke some assumptions I had about the database’s usage. This is often the case with managed services, where some management layer does not behave as you would expect.

Nonetheless, I would still use these technologies in a real application. They’re certainly robust, the debugging was not so hard, and neither was the proof-of-concept setup.

Generating Rock Climbing Route Descriptions with ML (GPT-2)

TL;DR

In this post I describe fine-tuning GPT-2, a powerful natural language model, for the task of generating fake rock climbing route descriptions. A climbing route description uses technical climbing jargon to describe a specific pathway up a rock face, and each one is unique. As seems to be common with GPT-2, I found that the model accurately captures the tone and specific language of the domain, but it sometimes generates impossible climbing routes given the physical world. It is also prone to repeating itself. At the bottom, I provide a Colab where anyone can download the fine-tuned model and try it out.

The Project

Since I created a rock climbing dataset a few months ago (described in this post), I’ve been thinking about ways to use it for home projects. Lately, I’ve been exploring concepts in NLP (Natural Language Processing) and trying to understand the state-of-the-art capabilities. Since the dataset contains a lot of textual information about climbing routes and logged climbs from the community, I figured there was an NLP opportunity.

At first, I considered trying to find climbing routes similar to an input route using BERT, a powerful natural language model that can be applied to a variety of applications. In a nutshell, BERT would transform rock climbing route descriptions into embeddings that can be mathematically compared. Then, a vector similarity search tool like ScaNN could use these embeddings to find the closest matches for a given route.

But, I realized that the route comparison problem was not a great fit for NLP for a few reasons:

  • Humans are already good at finding climbing routes using existing tools.
  • Comparing routes is based on many more factors than the description like difficulty, location, and discipline (trad, sport, boulder, etc). Accommodating these factors would turn the project from an NLP exploration to a feature-engineering grind on tabular data.
  • The use case didn’t feel real. I have never actually had the urge to find similar routes to a specific climb.

Then, I remembered a joke I have with my climbing friends where we describe fake routes. This is funny because route descriptions are esoteric and full of quirks. For one thing, they generally have a matter-of-fact tone, even when they describe scary or dangerous sections. Sometimes the description notes something exciting or intimidating, but the written form simply cannot prepare a climber for the real thing. For example, consider the beginning of the description of a famous climb called Moonlight, which of friend of mine has recalled as “absolutely terrifying”:

“This is a delightful 5.6, almost as nice as Madame G’s or High E. The crux is as exciting as any 5.6 in the Gunks.

Besides the understated mentions of adrenaline-pumping sequences, routes are also filled with climbing jargon. The technocratic language stands out starkly from normal language, which we can see by continuing on with Moonlight’s description.

The first pitch is easy to locate: Use the same access trail as for CCK, just past the Andrew boulder. At the top of the access trail you’ll see an obvious crack/corner system leading straight up. This is about 35′ left of the huge Erect Direction corner.

P1: Climb up the face just right of an overhang and head for an obvious anchor on some small trees. Then follow the corner up to a bolted belay at the GT Ledge. 5.5, 130′. Alternatively, you can start by climbing the crack behind the flake to the left, at 5.7+R, then continuing up the left-facing corner.

The text continues onward in a similar fashion. It’s mostly gibberish to non-climbers, but it’s trivial for my friends and me to synthesize it. Over time, we have become attuned to the subtleties of the structure, tone, and rhetoric of the descriptions. When we make one up, we are pattern matching on-the-fly to produce something that seems like the real thing but with a comical twist. As a concrete example, I have often described Ledgetopia, a fantasy route with significant injury risk on numerous ledges:

A classic at the grade, this route starts with delicate, unprotected face climbing off the deck. Access a large ledge and take a rest before a tricky sequence with marginal protection. With gear well below the feet, grab the lip of a sizable shelf and mantel up. Save some gas for a short but burly roof, eventually reaching another series of ledges before the chains.

It dawned on me that this practice is very similar to natural language text generation. In text generation, a model generates realistic, long-form text from a prompt. The field has attracted buzz in recent years since large, generic models like GPT-2 have achieved state-of-the-art performance on many canonical generative tasks (GPT-2 has since been superseded by the even larger and more powerful GPT-3). Most importantly for me, GPT-2 can be repurposed for any specific task using a process called fine-tuning. This means it can be taught to generate text in the style of a route description, thus making my friend group obsolete.

The Process

I decided to see if I could get GPT-2 to talk like a climber and generate fake route descriptions. First, I had to verify my expectation that GPT-2 would not create climbing routes out of the box. GPT-2 can be easily downloaded from huggingface.co, which includes simple sample code for trying it out. Here are some examples using the default model:

  • “A classic at the grade” —-> “A classic at the grade, an A in history, a G in history, and an A in writing. And yet those two things just don’t”
  • “Start on two crimps” —-> ”Start on two crimps to tighten it from its very point on. A small screwdriver that will stick it down.\n\n3) Use one”

So the initial hypothesis is correct, and we can safely conclude that GPT-2 does not speak like a climber by default. This makes perfect sense, since the climbing community is a small sliver of the internet. Next, I had to fine-tune GPT-2 to learn the ways of the climbing community. To do this, I collected 150K rock climbing route descriptions from a popular site called Mountain Project. I got rid of short descriptions, which are often boring, and split the rest into train and test data (90/10). Both consisted simply of text files where each route description had its own line.

Now unfortunately, I do not have a GPU. Virtual machines with GPUs can be rented from many cloud providers, but they can cost a few hundred dollars a month! After digging around a bit, I found out that Google Colab actually offers free GPU access. So, as long as I could fit the training onto a single GPU and download my outputs, I could fine-tune for free.

For the actual fine-tuning code, I found a small example from huggingface using an open-source script and a much more complete example from Joey S on medium. This post basically walked through the fine-tuning procedure on Google Colab using an example dataset. With relative ease, I was able to reuse that code on my own data. Along the way I ran into a few interesting quirks/limitations:

  • The Colab GPU is sometimes shared with other users, which can lead to memory issues during training. You can ensure that the GPU has no existing utilization by running the code described in this post by Ori Bar-El.
  • The fine-tuning script allows the caller to train assuming the data has one example per-line or train assuming the data is a single blob. However, the line-by-line option does not work with GPT-2 (see Github thread). So instead of seeing each line as a discrete example, GPT-2 trains on blocks of text, often covering multiple descriptions in one example. As we will see, this causes GPT-2 to often output a few route descriptions given a prompt, where only the first one contains the prompt.

After about 2.5 hours of training time, GPT-2 finished fine-tuning with a perplexity score of 19.02. I’m no expert in the perplexity metric, but the score seemed decent because the tutorial achieved a perplexity of ~20, and in OpenAI’s GPT-2 post they describe achieving state-of-the-art on seemingly similar tasks with a perplexity of ~18. From my amateur perspective, the point was that the number seemed within a reasonable threshold such that some amount of fine-tuning probably took place. Later, I should probably go back and assess the perplexity score of the GPT-2 model on my test set before fine-tuning.

Results

Finally, I was able to get some generated text out of this thing, and the results were very surprising! Going back to the initial tests above:

  • “A classic at the grade” —-> ‘A classic at the grade!  The crux moves are the start of the climb.  It starts with some thin crimps and then eases up to the top. The route name is a reference to the famous ‘Bolt Line’ on the wall above.”’
  • “Start on two crimps” —> “Start on two crimps to the left of the first bolt. Move up to the sloping shelf, then up to another sloping shelf. Top out on good holds.”

The anecdotes suggest pretty good understanding of rock climbing lingo! As I explored some more anecdotes, I even found that the model would generally keep descriptions within certain archetypes. Some examples include (prompts are underlined):

Boulder Problem

Start on two heinous crimps. From here, move up to the jug on the right and then up to the topout.”

Multi-pitch Climb

This route follows a wandering line so bring long runners.  There are some loose rocks in the rock, so bring some small gear.  There is a large ledge at the top to belay, so bring long runners.  There are two bolts on top. Pitch 1:  Follow the bolt line up and left on easy ground.  The crux is at the second bolt.  There are some loose flakes at the top.  There is another bolt on top.  Pitch 2:  Climb straight up from the belay to the top of the pillar.  The route starts on the left of the pillar, then moves to the right on easy ground to reach the top of the pillar.  Belay from gear in a crack in a left-facing corner.  Pitch 3:  Follow the crack up to the top of the pillar.  It’s easy to get lost here.  There is a fixed line up and right on the pillar at the top.  The route is about 20 feet to the left of the crack.  The crux is right before you reach the top.  The rock is good.  The top is a bit dirty, but not a problem.”

Old School Reminiscing Climber

Bob started climbing in the late ’80s with a friend who was a local. He had climbed the first half of the route, but it was only about 5 pitches, and it was the first time I ever climbed a 5.10a.  I thought it was a great route.  I think this was the first route I ever did at Red Rocks.  It is a fun, well- protected line that is not too difficult, but not very sustained or long.  I think it is a good warmup.”

Dangerous Route

A terrifying route, and a good warmup if you’re looking for a route that will keep you on your toes.  It’s a little runout on the upper half, but you’re not out of the woods either.”

These anecdotes demonstrate pretty decent understand of rock climbing, albeit with some mistakes.

Limitations

While the model seems to capture the tone of the climbing descriptions and roughly use the jargon correctly, it still has some limitations that clearly mark the descriptions as synthetic.

Repetition

Sometimes, the model starts repeating short phrases that are grammatically correct but odd. For example:

A lousy route with a bad landing, so don’t fall! The route climbs a thin crack system that goes straight up to a ledge with a large pine. The route starts with a short crack that leads into an off-width that leads to a ledge. The crux is a thin hand crack that leads into the corner and a ledge above. The crack then widens and becomes a chimney.  The chimney is the crux and the off-width is the crux and the off-width is the crux and the crack is offwidth.”

Despite the somewhat nonsensical repetition, this description does remarkably associate off-width climbing, traditionally feared by the climbing community, multiple cruxes, and dangerous landings as “lousy”. A further exploration would be to deviate from the training script with my own fine-tuning to attempt to reduce the repetition, possibly using techniques described in this post from Patrick von Platen.

Logical Fallacy

I noticed that in many cases, the model describes grammatically correct phrases that don’t make sense for climbing. These mistakes are often subtle enough that a non-climber might not notice, but a climber can tell immediately. Some quotes include:

  • “…it was only about 5 pitches…but not very sustained or long.  I think it is a good warmup.” 5 pitches would generally be considered much too long for a warmup.
  • “Start up a short, steep corner with good handjams for pro.” A hand jam, though often described as secure, is not actually protection.
  • “I first climbed this 20 years ago when a large block was chopped off by a tree.” It seems unlikely that a tree would chop off a large block and more likely that a large block would potentially break a tree.

Try It Out

Initially, I had grand plans of hosting my model in a live application so that folks could try it out. But, I discovered that predictions were extremely slow on a CPU and still pretty slow on the Colab GPU. Hosting the server with one or more GPUs would be expensive, but I still created a Colab that you can try! The model is available via a publicly accessible Google Cloud Storage bucket, and the Colab uses it for inference.

Training Colab

If you want to dig more into my actual fine-tuning code, please see this Colab. Note that the data is not accessible, since I could not publish Mountain Project data off of their site per their terms of service. I have also left out the exact scraping code because I’m not sure about the rules there, but I describe the process in a prior post.

Understanding Rock Climbers With Data (Part 2)

In part 1 of this series, I described my goal of understanding the rock climbing community with data. I also summarized my process of collecting and cleaning about 4 million climbing records (ticks). After cleaning my tick dataset and converting it into a table of climbers, I attempted clustering on the climbers, only to find that climbers are generally sorted by how much they climb (or possibly how much they use Mountain Project).

I think it’s possible there are more interesting clusters to be found in the data, but the results mostly depend on how I convert from ticks to users. And my initial climbers table lacked a critical component: geospatial awareness. While each tick is labeled with a route and climbing area, the geographic locations of those areas are not considered. Instead, the areas are simply represented as strings. For example, the popular High E area is represented as “New York > Gunks > Trapps > High E” in the dataset. The fact that California and New York are physically far apart is not considered.

Geospatial awareness could have a lot of implications. For example, I suspect distance and frequency of travel could contribute to clustering climbers by income. Additionally, the remoteness of climbing destinations could be a defining feature. Many climbers in my own circle have only traveled to easy-access areas like Rumney, Red Rocks, and The Gunks. Few venture to less traveled destinations like Ten Sleep and The Devil’s Tower.

In this post, I’ll describe how I added geotags to my tick dataset and visualized that data.

Adding Geotags To The Dataset

As mentioned above, each climbing route belongs to an area. Fortunately, most climbing areas on Mountain Project are annotated with GPS coordinates. Since both routes and areas have unique URLs, the route URL can be used to resolve the area URL. And since my ticks table contains route URLs, I was able to write a second Scrapy crawler that finds the GPS coordinates for each route. You can force your crawler to examine a specific set of URLs by using the start_requests method.

This crawler ran on my laptop for about four hours and eventually produced a CSV where each row was a route URL, area URL, and GPS coordinates. Areas for which GPS coordinates were missing were simply skipped.

Finally, I joined this areas CSV with my ticks CSV using the Pandas Dataframe merge function (inner join to make sure all remaining rows had GPS coordinates). This process, combined with the cleaning described in Part 1, left me with almost 3 million ticks. While this is substantially less than 4 million, it is still plenty to play with.

Exploring Geotagged Data

After achieving my first goal of geotagging ticks, I quickly realized that exploring geospatial data is hard without data visualization. Even simple questions like whether a tick resides in a certain state requires a lot of effort. The data becomes much more useful when displayed on a map. After exploring a few other mapping options like Matploblib’s Basemap and Leaflet, I discovered gmaps, a Jupyter plugin that uses the Google Maps API.

gmaps comes with some sophisticated tools for visualizing geotagged data. The heatmap feature was most relevant for my use case, and within a few minutes, I was able to visualize my ticks on a map.

A heatmap of ticks globally. It is unsurprising that most ticks are in the United States, since that is Mountain Project’s target demographic. Small spots can also be found in Thailand and Greece, likely due to US tourism.
A closer look at the hotspot in Thailand. The climbers are visiting a popular peninsula called Phra Nang/Railay Beach.
A closer look at the hotspot in Greece. Climbers flock to the Island of Kalymnos, a world-class destination according to Mountain Project.

Early on, I noticed that without any adjustments, Colorado burned far too bright for the rest of the world. Gmaps recommends solving this problem by setting a ceiling on the intensity of the maximum peak. This cap allows other destinations to show.

A comparison of heatmaps before and after implementing a maximum heat. In the top map, Colorado and a few other areas dominate. In the bottom map, other popular but not ridiculously popular areas show.

I decided to focus my exploration on North America (and the US in particular), since that’s where the vast majority of ticks are located. When the heatmap is placed alongside a map of the most popular climbing destinations, the correlation is obvious.

Exploring Geo Data Over Time

After creating an initial heatmap, I wanted to see what the data looked like over time. I was especially interested in seeing if migratory patterns could be visualized over the course of a year. I decided to create a heatmap for every week in 2019 and combine those images into a video.

In python, it’s easy to create a list of dates using datetime:

from datetime import datetime, timedelta
dates = []
datetime_object = datetime.strptime('Jan 1 2019', '%b %d %Y')
for i in range(52):
  dates.append(datetime_object)
  datetime_object += timedelta(days=7)

With that list of dates, the data can be sliced by applying a mask to the dataframe. For example:

mask = (df['Date'] > dates[0]) & (df['Date'] <= dates[1])
df_for_the_week = df.loc[mask]

For each of the 52 weeks, I created an image. Unfortunately this part was pretty laborious, since it is actually not possible to automatically download a gmaps map as an image (there is a Github FR for this). 52 is in the sweet-spot of possible but painful, so I decided to button mash instead of automating.

Next, I had to combine the images into a video. This was an interesting topic to research, since there are so many options. Many folks do this using paid software or half-baked, free tools online. Others use complicated video-editing applications. When I switched my searches to be more programmer friendly, I realized that this could be handled by ffmpeg, a powerful audio and video processing tool.

As it turns out, ffmpeg even has documentation dedicated to this problem, which they call slideshow. Add in some magical incantations to make the resulting slideshow compatible with QuickTime (see StackOverflow post), and you get the following command:

ffmpeg -framerate 1 -pattern_type glob -i '*.png' \
    -r 30 -pix_fmt yuv420p -vcodec libx264 \
    -s 984x384 output.mp4

This command synthesized my 52 images into a video, giving each image one second of playtime. I manually verified the order of the images, since I was a little skeptical of the glob keyword. As it turns out, I had a minor bug at first because the files were numbered, but ffmpeg orders them alphabetically (this will make 10.png come before 2.png).

For some added drama, I did a little bit of editing in iMovie to add labels for each week of the year. This allows you to more easily skip around in the video to look at particular weeks. Some of the images get cut off at the bottom, which is an ffmpeg quirk that I have not solved yet.

A slideshow of tick heatmaps week-by-week for 2019.

The video is pretty fun to watch, and I noticed a few interesting things about climbing patterns:

  • While areas like Red Rocks and Colorado have climbing year-round, cold areas like Squamish and Rumney as well as hot areas like El Portrero Chico have distinct seasons.
  • Michigan, where I have yet to visit, seems to have an especially short climbing season.
  • Colorado and Yosemite have vastly more climbers than other areas. They are consistently the hottest parts of the map.

Conclusion

Ultimately, this exercise taught me a lot about data processing, web scraping, and data visualization. It also made me appreciate the difference between software engineering and data science in terms of deliverables.

After spending a few hours on this project, I quickly amassed a mess of PNGs, unfinished videos, and Jupyter blocks. The deliverable is essentially this post, which will eventually be hard to correlate back to my materials as they get lost over time. Meanwhile, in software engineering the deliverable usually is the software (with documentation). The materials that made the project can’t really be lost, since they are baked in.

The project also made me feel a little more skeptical about data visualizations in general. Bugs can easily creep in during all the data chopping, and the product cannot be rigorously tested like a software project. The author needs to compare the visualization with their expectations of what it ought to look like. There may be more thorough verification methods, which I plan on investigating.

I still haven’t actually used my tick dataset to predict anything, which was one of my original goals. I also haven’t deeply explored my dataset for interesting correlations (for example between max redpoint grade and years climbing), which could reveal truths about climbing more easily than a predictive model. When I have time again, I’ll dive back in and try another exploration.

Understanding Rock Climbers With Data (Part 1)

By now, being a software engineer and a rock climber is somewhat cliché. This archetype is usually explained by the cerebral similarities between tricky climbing sequences and programming problems. My more cynical theory is that many software engineers avoided traditional sports in their youth, and climbing offers a friendly path to adult fitness. Whatever the cause, climber-engineers are in luck because climbing produces a lot of interesting data, allowing these folks to combine hobbies.

The climbing community creates a lot of technology. Prominent sites like Mountain Project, SuperTopo, and Gunks Apps immediately come to mind, but app stores are also littered with less polished projects. Add in the online technical commentary about gear, and you have quite a collection. To add to the pile, I decided to see if I could understand the climbing community better with data.

Framing The Data Problem

As mentioned above, Mountain Project is probably the most-recognizable climbing app. It allows users to find climbing routes, review them, and contribute content like photos or helpful descriptions. It’s common to see folks at the base of a cliff scratching their heads and looking between their phone screen and the wall. These are climbers attempting to correlate a particular skinny tree on their phone with the real thing.

When climbers complete a climb, they can create a Mountain Project “tick”. The tick serves as a record of the climb, and it includes metadata like the location, difficulty, how the climber performed, and freeform notes. Ticks are available publicly (see mine), and climbers often use tick histories to search for partners.

I decided to capture some tick data and see what kind of questions I could answer about the climbing community. The analysis is ongoing, but here are some goal I have in mind:

Predicting Climbing Popularity

Climbing is exploding as a sport, with new gyms cropping up all over the United States. Many of these newcomers eventually explore outdoor crags. Predicting this outdoor traffic could help parks prepare for future demand. If I could predict popularity based on location, then I could also possibly avoid the crowds! Since ticks are associated with dates, this can be framed as a times series forecasting problem.

Recommending Climbing Partners

As I mentioned above, many Mountain Project users scour the app looking for partners based on ticks. This mundane task could potentially be automated with a recommender system. Recommender systems typically compare users using their choices on a platform (Spotify songs, Netflix shows, etc). We can model ticks as these choices, which enables us to recommend either climbing routes or climbing partners. Note that this assumes that two climbers that are very similar would be good climbing partners, which is an assumption I am making based on experience.

Understanding Climbing Archetypes

If we can compare climbers by their tick histories, then we can also try to segment them. Businesses often perform cluster analysis on their customers to try and understand the different personas they attract. In my case, I just think it would be fun to see if I can find hard evidence of climbing myths like the Gumby, the Trad Dad, the Rope Gun, or the Solemn Boulderer.

Getting the Tick Data

So of course, first I had to get some tick data. Note to that respect Mountain Project’s terms of service, I do not post the complete code I used to do this or the data itself.

Perusing the site, I noticed that each user’s ticks page has a handy “Export CSV” button, which downloads a CSV of their ticks! Using a powerful Python web scraping tool called Scrapy, I was able to cobble together a crawler that looks for ticks pages and downloads a CSV for each one. If you want to try this out, remember two things:

  • Be nice to the site you are scraping and use Scrapy’s AutoThrottle
  • Use Scrapy Jobs so that your crawler can start and stop

My crawler ran for about 4 days straight on my laptop, eventually completing with about 83K CSV files! Using my expert StackOverflow search skills, I found this answer to help me combine them into a single CSV file.

Tasting the knowledge ahead, I rushed to get this CSV into a queryable format. My first idea was to stage the file on Google Cloud Storage and import it into BigQuery for exploration. BigQuery supports this feature, so I thought this would be trivial, but I was naive to a major peril: data cleaning.

Cleaning the Data For Import

When BigQuery tries to ingest a CSV, it fails upon encountering errors (you can configure how many errors to allow before failure). These errors often refer to a specific position in the file. When I encountered these errors, I jumped to the position in the file by opening it in vim (slow if the file is large) and jumping there with goto.

In my case, I learned that some of the ticks contained a carriage return character. This character is actually somewhat difficult to create on a Mac, but ultimately I was able to simply remove it from the file using vim regex commands. I got lucky: this was all that I needed for BigQuery to accept the data.

Exploring the Data

At long last, tick data was at my fingertips! I started by querying some fast facts to understand what I was working with:

  • 3,849,902 ticks
  • 114,992 routes
  • 85,143 users
  • 27,433 crags

Next, I wanted to answer some questions along a variety of dimensions. Check out the captions for assessment of each image.

Most popular crags. Unsurprisingly, Colorado dominates the list. The Gunks has likely made its way there due to proximity to New York City. Little Rock City is a popular bouldering crag, so there are probably many more ticks per-capita. Lastly, Horseshoe Canyon Ranch likely gets a boost from the Horseshoe Hell event.
Ticks and users over time. Clearly the Mountain Project community has gotten a lot more active! This probably reflects overall trends in the climbing community.
How climbers split their time between the three main disciplines. Though the relative increases at the tails are not huge, they do show a tendency for sport climbers to stick to sport climbing and for trad climbers to be the most diversified. The bouldering extremes may be distorted by the fact that Mountain Project does not cater as well to bouldering.

Tick distribution across users. The vast majority have not made many ticks, while a few outliers have created a few thousand. It’s not clear whether this means many climbs go un-ticked, or the vast majority of climbs are completed by a small group.

Just for fun: a word cloud of text used in tick notes. Note that “OS” and “O” are referring to onsight (when the climbers sends the climb on the first try, with no prior information).

Attempting Climber Segmentation

I decided that first I would see if I could discover climber archetypes. I chose this one first because it was fun and because BigQuery supports k-means clustering out of the box. While k-means isn’t the only clustering method or necessarily the best one for this task, I figured it was low hanging fruit.

The first problem I encountered was that I had a table of ticks but I wanted to cluster users. I needed a way to map ticks to users. Based on some research and advice for friends, there was actually no standard procedure for this.

In an example where companies are clustering based on stock data, a column is created for every day, where the values are the changes in stock price for each company. When I looked at the RFM technique, commonly used for user segmentation, I found that “categories may be derived from business rules or using data mining techniques to find meaningful breaks.” In this assessment of Instacart users, Dimitre Linde describes the features he builds from the purchase data. It seems like the real art of clustering comes from the feature extraction.

I decided that I wanted to understand climbers by both how often they do different activities and which activities they do. I also thought about the personalities I suspected and tried to tailor the columns to them. Ultimately I settled on the following categories: months climbing, number of ticks, number of ticks on a weekday, number of ticks on a weekend, number of trad climbs, number of sport climbs, number of boulder routes, number of multipitch climbs, number of bigwall climbs, hardest grade climbed, mean grade climbed, number of locations they ticked, number of states they climbed in.

Note that to make climbing grades comparable, I converted from the US system (which has numbers and letters) to the Australian system (which has ordered numbers) using this chart. Bouldering grades can also be converted to this system.

Unfortunately, my results were somewhat disappointing. I performed k-means clustering for 2, 3, and 4 clusters, but in all cases, the clusters clearly broke down by climbing time. BigQuery shows the centroid value for each feature, allowing me to get a feel for the meaning of the clusters.

Centroids for clustering with three clusters. The centroids break down by “small”, “medium”, and “large” for every feature.

It’s still not clear whether this was due to poor feature development or whether this is genuinely the best way to segment climbers. After all, anecdotally differences in volume do seem very meaningful among my climbing friends.

I tried a second experiment where I used percentages for the columns instead of absolute numbers to eliminate the differences from simply climbing more. This time, users seemed to segment mostly by time multipitching and hardest grade.

Using percentages, we can see that the biggest discrepancies between centroids appear for percent bigwall, percent multipitch, and max rating. This could be the a glimmer of “the sport climber”, “the trad dad”, and “the all-around climber”, but it is hard to say definitively.

Overall, I wouldn’t say clustering has yielded anything very meaningful yet. From my reading it seems notoriously fickle, since it is totally unsupervised. Next steps would be to attempt PCA so that I can visualize these clusters and see how logical they look. I may also try to derive more complex features like “how far they travel” and “how often they go on climbing trips”.

Why I Care About Sending Rock Climbs

Whether on plastic or granite, I am a stickler about sending. If you hang around the gym, you’ll hear a lot of talk about sending this or that route. At first glance, some of these tales will surprise you. (ex: Johnny only started climbing a few months ago, and he just sent his first 5.12!) If this hearsay arouses your suspicion, you can rest assured that all is still right with the world. Johnny is using an incorrect definition of sending.

I don’t mean to diminish Johnny’s excitement, pride, or love for climbing. I’m psyched that he’s psyched! And yet, despite my efforts to ignore it, Johnny’s story bothers me like a tilted frame. It’s great that he feels accomplished, but his statement is incorrect.

Clearly, we need to establish a definition for “sending”. It’s more formal than other jargon like “shredding gnar”, “thuggy”, or “run out”. To borrow from Wikipedia, “to send” means “to cleanly complete a route. i.e. on-sight, flash, redpoint”. If those terms are unfamiliar to you, these definitions are helpful. For even more context, this thread offers a thorough discussion of the jargon.

After pondering the standard definition, a careful reader may notice that there is some wiggle room in the definition: “sending” can technically be accomplished on top-rope or on lead. In practice, sending is generally considered possible only on lead, except for in exceptional circumstances like team ascents of mammoth big walls [1]. In the sport climbing community, which cares the most of about these distinctions, sending assumes lead climbing. If you ask around the crag, most climbers will echo this opinion.

I’ve talked with some folks who don’t agree with attempting to define the terms at all. Why not let everyone climb however they want and leave them alone? To be clear, defining terms is totally disjoint from people’s enjoyment. You can climb however you want, and you can label those climbs however you want. But climbing, like any sport, uses established rules to make sense of achievements. Everyone is allowed to play soccer by picking up the ball, but most people derive meaning from it by playing within the rules [2].

And playing by some rules is harder than playing by others. Leading is harder mentally and physically than top-roping. Finishing a route without breaks is harder than with them, and doing that on the first attempt is even harder. These objective truths are the seed from which terminology is born. As climbers aim to capture and compare their achievements, they invent jargon to distinguish one climb from another.

I subscribe to the rules of this game and play it to the best of my ability (which is relatively mediocre) to keep climbing interesting. Otherwise, it starts to resemble my past athletic hobbies like running and weight lifting. The goal becomes maintaining fitness, and that isn’t incentive for me to commute to the gym. Sending instills purpose into my workouts. Projecting (repeating a route many times until finally sending) creates the illusion that my climbing is actually going somewhere.

I call this an illusion because anyone who asks “why” enough times will realize that their climbing is not actually going anywhere. A talented few climbers will learn that “why” even erodes explanations like money or fame. These “whys” try to bog climbers down in a nihilist depression, so everyone needs their own armor against them. For me, giving mystical significance to sending compels me to get out and touch some rocks! I’m open to other methods, but empirically this one appears the most successful for climbers over time.

Induction into the cult of the send isn’t a guarantee for happy climbing. Weighing a project too heavily can actually make climbing less fun and more stressful. I use most of my time off for climbing trips, and I sometimes reflect on the absurdity of calling a few days of fatigue, nervousness, and wild excitement a vacation. These trips require a delicate balance of two juxtaposed ideas: intense desire for the goal and the journey as an end in itself. I enjoy the exercise of managing my expectations, and I don’t really have a choice. Removing the pressure from climbs morphs it into a beach vacation, which quickly leaves me bored.

My point is that if Johnny ever starts to lose interest in climbing by his own rules, sending can be a tool to reinvigorate him. Partaking in the sport as defined by the community can fuel focused training that would be foolish for a casual climber. If the carrot is personal accomplishment and the stick is getting to the gym, then sending is the biggest carrot I know. At your next session, if you notice Johnny acting aloof, mumbling about his skin, and miming crimp movements in a trance, then you’ll know that he has joined the cult. Never mind his newly acquired symptoms of enlightenment.

[1] For example, professional climber Kevin Jorgenson top-roped some of the easier pitches of The Dawn Wall, which he and fellow professional Tommy Caldwell sent together. Whether this style is truly a send is contested, and this (somewhat immature) thread debates that question.

[2] The soccer analogy is not my own; it was described by Chris Kalous, author of The Enormocast, in a contentious thread on the topic.

Rule #1 At The Gunks and Probability

If you climb, especially at The Gunks, you may have heard Rule #1 tossed around the crag. If you climb often, you may have even heard Rule #2. Combined, this counsel is meant to guide your choices on the rock:

Rule #1: Don’t fuck up and die — the rest are details.
Rule #2: If you do violate Rule #1, don’t take anyone else with you.

The Rules are generally accredited to Peter Darmi an old-school Gunks local who established bold routes like Into Thin Hair 1; this quote comes from an article he wrote in 2005. In fact, although I have only met Darmi twice, he has actually told me Rule #1 in conversation at the cliff top.

When I first heard The Rules as a budding trad climber, I often considered them on the wall. If I got too runout, would I violate Rule #1? If my anchor wasn’t exactly right, would I even violate Rule #2? Fortunately I did not violate either, but I never knew whether that was thanks to my skills or dumb luck.

And that is exactly the problem: The Rules validate outcomes and not decisions. Climbers spend a lifetime making tough decisions, but they cannot control outcomes. Sometimes bad decisions still lead to good outcomes and vice versa. If a climber repeatedly survives excessive runouts, is she a good climber? Using The Rules as a framework for success can reinforce negative habits.

All decisions are probabilistic by nature. This truth was crystalized for me when I read Nassim Taleb’s Fooled By Randomness, a book about probability in the markets. Though Taleb describes the world of financial trading, his lessons are extremely relevant to climbers.

Traders buy and sell products based on their market hypotheses. Some consistently make risky trades, winning big payoffs in the short term. They appear to be successful traders, and others attempt to model their behavior. But eventually, these traders lose a monumental amount of money in a lost bet, referred to in the industry as “blowing up”. This system is identical to climbing, which features a series of risk-based decisions and the threat of a catastrophic “blow up”.

Taleb’s strategy is more precise than “don’t blow up”. Instead, he maintains a portfolio where he does not believe a “blow up” is possible. This makes sense for any career trader, since the odds are high that at least some trades will go south. Good traders hedge their bets so that they cannot be ruined by any single failure. Good climbers do the same thing with redundant anchors, going in direct on two points, etc. If Taleb’s advice were translated for climbers, it would be something like:

Don’t enter a situation where you could fuck up and die.

But there is a caveat. The advice above prohibits lots of dangerous climbing sub-activities like free soloing, leading ice routes, and X-rated routes. Partakers in these pursuits out-maneuver this implication by explaining that their competence makes the activity as risky as something commonplace like driving a car. They maintain that the same decision has different risk for different folks, making it good for some and bad for others (e.g. choosing to free solo El Capitan is good for Alex Honnold and bad for me).

I think we can rationalize these activities but not by pretending that the potential for “blow up” is negligible. Stripping away redundancy makes these activities simply more dangerous than their better-protected counterparts. Even cars come with seatbelts and collapsible frames. Instead of downplaying the risk, folks need to ask themselves why they accept it.

The author hamming it up on a classic Gunks route called CCK (worth the risk).

Each of us has to decide what reward is worth the risk. I’ve decided that driving is important enough that I will take the chance. Honnold has decided that he would risk death for free soloing. For him, abandoning free soloing can actually be considered another form of “blow up”. The key insight is that “blow up” is a bigger subset of outcomes than just death, and it’s different for everyone. Both of us can be rational people if we listen to our own interpretations.

I don’t want to dictate anyone’s decisions. I do want people to consciously make their risky choices. I love trad climbing, and I’m not stopping any time soon. Every time I rope up, there is a higher chance of “blow up” than if I stayed on the ground. When I contemplate a route from the base, I don’t delude myself into thinking that catastrophic failure is impossible. Instead, I dip my hands into my chalk bag, look up towards the sky, and remember that this is worth it.

Backend In the Frontend: Implementing Raft in JS

Background

Lately I’ve been studying consensus algorithms to bolster my understanding of distributed systems. Consensus algorithms achieve agreement on data that is replicated across many nodes.

Consider an online store. At any time, the store has a state that is defined by all of its transactions. The store keeps a transaction log that can be used to recreate that state. The log is replicated across multiple servers so that the store is robust against server failures. A single machine may go down at any time. But when it comes back online, it can use the log to recreate the store’s state.

Before new transactions can be added to the log, the machines must reach consensus on them. If each machine didn’t check with the others before adding something to the log, then their copies of the log would drift apart. Eventually, this would result in adverse behavior like the same widget being sold twice or a successful sale getting forgotten.

There are many consensus algorithms out there. I started with Paxos, since I knew that it powered Spanner. After reading through a whitepaper on implementing Paxos, I found it to be powerful but also relatively enigmatic for a distributed systems layperson.

As it turns out, I was not alone in my confusion. Two students at Stanford who shared my frustration with Paxos’s complexity created a simpler consensus algorithm called Raft. Raft breaks its algorithm down into more discrete steps that could be easily understood. I decided that it would be fun and helpful for my understanding to implement Raft in JavaScript using their whitepaper. Note that the Raft page has a different implementation made by the paper’s author.

This post doesn’t go into depth on how Raft works, but if you’re interested, I recommend checking out the Raft site. Instead, it displays the simulation I made and describes the underlying code.

TL;DR: The Product

Three nodes sit inside of an SVG graphic. Even though Raft would usually be implemented with five nodes, I chose three to optimize simplicity over robustness (remember folks, it’s not real). Raft can operate with just two nodes, but that assumes neither will ever go down.

At initialization, the nodes have no leader. When a leader is finally chosen via a timer-based election, it will remain in power unless manually turned down (i.e. clicked). In the absence of any new data, the leader will reaffirm its reign with the other nodes by periodically sending heartbeats to the other nodes.

Meanwhile, a client perpetually fades in and out of the edges of the graphic, periodically sending new data to the system. The client only speaks to the leader in this simulation, but Raft allows the client to speak any node and be redirected to the leader.

When the client sends new data, the leader persists and attempts to replicate it on the other nodes. Once a majority of nodes report back that they have replicated the data, the leader will commit it. At this point, the leader could report success back to the client, but that is not implemented.

Clicking a node will turn it down until it is clicked again. This allows you to interact with the simulation and explore some specific states. You can also pause to drop (click on) messages.

What I Learned

While reading the white paper can give a decent understanding of Raft, implementing the algorithm crystalizes its nuances. In addition, faking components of the networking stack also made me think a lot more about building blocks that I normally take for granted, like request-response protocols.

Observing Values Purged From a Replica’s Log

One of the most complex situations in Raft is when a replica has a set of log entries that differ from the the committed entries at those indices. This happens in the following scenario:

  1. The leader receives some new values but is not able to replica them on a majority of replicas. So, the values are uncommitted.
  2. The leader goes down.
  3. A new leader is chosen, and it commits a handful of values. Now, at some indices i through k, the old leader has bogus values.
Replica 2 recorded a few values as leader but was not able to commit them. After it was turned down, Replica 0 committed a bunch of new values.

When the old leader comes back online, it will attempt to maintain its leadership reign. Instead, it will learn about the new leader, purge all of its outdated values, and replace them with the committed values. Watching this in action is surprisingly satisfying!

Replica 2 has been corrected. It purged the outdated values and replaced them with Replica 0’s values.

Raft Is For Fault-Tolerance, Not Performance

At first glance, replacing a single machine with a Raft consensus network may seem like a way to boost an application’s performance. In fact, it is the exact opposite: Raft adds more latency, CPU, and storage in exchange for robustness. In the Raft algorithm, only the leader can serve both reads and writes. And, since leader must commit a value before returning success to the client, responses are slowed. The CPU and storage come from the extra machines at work.

Raft Only Works Because of Trust

As I was implementing the algorithm, I thought about how odd it was that the leader would step down from power as soon as it had heard about a leader with a more recent election term. The leader respects this information no matter who tells it. Raft actually guarantees that the leader with the most recent election term (i.e. highest term number) is the true leader.

Raft won’t work without assurance that all of the replicas can be trusted. Algorithms for networks that may include rogue nodes are discussed via the Byzantine General’s Problem. Interestingly, both Paxos (precursor to Raft) and the Byzantine General’s Problem were invented by Leslie Lamport. Consensus algorithms without trust are considerably more complex than Raft, and they power peer-to-peer systems like the famous Bitcoin blockchain. (For those interested, Bitcoin is powered more specifically by a Proof-of-Work system called Hashcash).

Difficulties in Testing Correctness

After just a few hours of work, I had an implementation of Raft that appeared to be “mostly correct”. But, I quickly realized that without extensive unit test coverage (which I was not interested in writing for a learning experience), it was nearly impossible to determine whether I had achieved real correctness. As I watched the simulation, I would sometimes observe odd behavior and note for later investigation. After checking many of the edge-cases in the system using a combination of the pause button and the turn-down features, I believe I have handled most or all of them correctly.

Avoiding Derivative State

Thankfully, the Raft paper describes the necessary state that each replica must hold to function. To avoid a messy soup of instance variables, I derived many other “states” from these variables. For example, state like isLeader can be derived from the bare bones information.

High-Level: The Architecture

The simulation uses object-oriented programming to represent the distributed system. The Replica class handles all logic related to a single node, and hence the bulk of the algorithm. Replicas are responsible for rendering themselves, handling incoming messages, and sending out messages. They also push updates to the TableUpdater class, which reflects those updates in the table.

The Message class is responsible for rendering a message, determining its component velocities, and marking itself dropped when clicked. Messages are created using factories to avoid passing repetitive parameters into each message constructor. The parameters are dynamically determined by simulation-wide constants (FPS, simulation size, etc).

Once a message is in-flight, it is handled by a singleton MessageManager. The MessageManager is responsible for moving messages, delivering messages, and removing dropped messages.

A singleton ClientManager creates and destroys Clients. The ClientManager is given an average client lifetime, a rate of client creation, and a maximum number of clients at any given time. I actually determined that one client was enough for the simulation, so some of these features are unused. Once a client is created, it periodically sends new data to the Raft replicas.

The Animation Loop

Th simulation uses a standard animation loop [1]. It is updated frame-by-frame at a specified FPS. The simulation continuously polls the browser for new animation frames and handles them at the correct rate. Code (see original):

var FPS = 60;
var interval = 1000 / FPS;

function draw() {
    window.requestAnimationFrame(draw);

    var now = new Date().getTime();
    if (now - time > interval) {
        time = now;
        // DO STUFF HERE (ex: myObj.handleFrame())
    }
};

$(document).ready(function() {
    window.requestAnimationFrame(draw);
})

The Messaging Layer

In the real world, the nodes in a distributed system would communicate over an established messaging protocol like RPC. And, they would sit on top of networking technologies that would route requests and responses to the right addresses. In the simulation, this all needs to be faked. The entities (clients or nodes) in the animation need to have unique addresses, and they need to send messages over an agreed-upon protocol.

Establishing Addresses

Each node’s “address” is its coordinates in the graphic. When an entity sends a message to another, the message determines its direction by breaking a predetermined speed (scalar) into x and y component vectors. Code (original here):

getComponentVelocities(sender, receiver) {
    var dx = receiver.x - sender.x;
    var dy = receiver.y - sender.y;
    var theta = Math.atan(Math.abs(dy / dx));

    var v = this.v + (Math.random() * this.jitter);
    var vx = v * Math.cos(theta);
    var vy = v * Math.sin(theta);
    if (dx < 0) { vx *= -1; }
    if (dy < 0) { vy *= -1; }
    return [vx, vy];
}

In order to add realistic randomness, jitter is added to each message’s velocity. This reduces the chances that messages arrive in the exact same frame.

Delivering Messages

On every frame, the MessageManager deletes dropped messages, moves in-flight messages forward, and delivers arrivals. A message has arrived when its overlaps with the recipient.

One interesting quirk here is that delivering a message can actually result in a new in-flight message. A naive message manager would iterate over its messages and remove arrivals. That will not work, since its messages can actually be altered during that process. In a real, multi-threaded backend, the messages would be a global variable protected by a mutex, and this naive implementation would result in deadlock.

Fortunately, we can get around the deadlock by storing the current messages in a temporary variable before handling them. Code (see original):

var tmpMessages = this.messages.slice(0);
this.messages = [];
tmpMessages.forEach(function(msg) {
    var receiver = this.receivers[msg.receiver];
    if (receiver.containsMessage(msg)) {
        receiver.handleMessage(msg);
        msg.cleanup();
    } else {
        inFlightMessages.push(msg);
    }
}.bind(this));

// Concatenation is essential, since this.messages may have
// been appended to since the reset above.
this.messages = this.messages.concat(inFlightMessages);

Handling Arrivals

Instead of inventing a new protocol, we can fake one with JS features. Each message has its own class that extends the base Message class (ex: AppendEntriesRequest). When an entity receives a message, it can use the message’s constructor to determine how to handle it. Code:

    handleMessage(msg) {
        switch (msg.constructor) {
            case RequestVoteRequest:
                this.handleRequestVoteRequest(msg);
                break;
             // More cases left out for brevity.
            default:
                break;
        }
    }

Correlating Responses With Their Respective Requests

Another messaging challenge I faced was that responses were totally disconnected with their respective requests. In most messaging frameworks (RPC, AJAX, etc), the caller has request information handy when processing the response. My solution was hacky but sufficient. Each node stores a map from request ID to request, and responses contain the request ID. Code:

this.pendingRequests = {};

sendRequest() {
  var msg = new Message();
  this.pendingRequests[msg.id] = msg;
  this.messageManager.schedule(msg);
}

handleResponse(msg) {
  var request = this.pendingRequests[msg.requestId];
  // Do some processing...
  delete this.pendingRequests[msg.requestId];
}

Deployment

The simplest way to share this project was via iframe. I exposed the simulation through Github Pages, and you can find the iframe’s link here.

The code uses ES6 modules for organization. I learned that many minifiers can’t operate on multiple JS files or a single file made from modules. Ultimately, I used an npm module called rollup to combine the JS files and terser to minify the result. Before pushing a new version, I update the minified JS with a small bash script:

#!/bin/bash
rollup main.js --file bundle.min.js --format iife
terser bundle.min.js -m -c -o bundle.min.js

Further Work

A number of further features could be added to the simulation:

  • Implement group membership changes (nodes leaving and entering)
  • Responding to the client when its data has been committed
  • Giving the client the ability to perform reads and not just writes
  • Implementing snapshotting

Conclusion

After spending over a year writing exclusively C++, it was fun to combine thinking about the backend with coding in the frontend. Implementing from a whitepaper was also new for me, and the work helped me understand the material on a deeper level. Hopefully this visualization will help some folks learn about consensus!

[1] I learned about the easiest way to make a JS animation loop here.