DIY Coinbase Portfolio Metrics with AWS

TLDR: To track my Coinbase portfolio for free, I Dockerized an existing script that generates portfolio metrics and ran it continuously on AWS. Read on to learn about my motivations and the process. Code is located on Github. Thanks to Matias B for providing the original tool.

I generally have a good experience using Coinbase to HODL cryptocurrency, but it is completely missing portfolio metrics. Portfolio metrics may not seem important when you buy your first batch of crypto. But after compulsively refreshing the Coinbase app to watch crypto prices yo-yo, you will ask yourself a simple question.

Am I up or down?

At first, you may be able to answer this question before that bead of sweat fully forms on your forehead. Maybe you simply remember the price at which you bought the crypto. Maybe you find Coinbase’s transaction email notification. Maybe you learn about getting your transaction history via Coinbase Reports. Either way, you’re able to determine whether you’re up or down, and life goes on.

That is, life goes on until your portfolio becomes more complicated. As you buy and sell various cryptocurrencies at various prices, this question becomes more elusive. Eventually, it will become impossible without some tooling.

Understanding your portfolio’s performance is really important for two reasons:

  1. You need to understand your unrealized gains/losses to profit from trading crypto assets.
  2. You need to include your crypto earnings in your taxes.

When I failed to find Coinbase portfolio metrics, I was convinced that I was missing a hidden tab in the Coinbase interface, stumped by a clunky user experience. Then I learned that I was not alone. I found multiple discussions about this hole in the crypto trading journey. It was pandemonium. Reddit power users lamented that noobs did not have enough braincells to instrument an auto-updating performance spreadsheet, while official Coinbase documentation shoved users towards a paid service called CoinTracker.

I discovered that CoinTracker offered a shiny interface out of the box, but most features would cost at least $15/mo, and the tax report would be at least $48 for anyone with over 25 transactions. Meanwhile, a Coinbase competitor called Voyager seemed to offer better metrics, but I did not feel like moving all of my crypto holdings just yet. Instead, I wanted to find a free, DIY solution to get some high-level portfolio metrics.

Eventually, I found a very detailed blog post offering a Python script that could read information from my Coinbase account and dump the results in a Google Spreadsheet. I was delighted to find that the whole thing worked as expected.

But those metrics were frozen in time, and I wanted to continuously update the data without entering a daily terminal command. While there are lots of ways to do this, I decided to use AWS to continuously update my portfolio metrics. This approach would not require my own computer to be online, and it would dovetail nicely with my separate goal of gaining more AWS experience (see prior post).

Continuously Updating Portfolio Metrics on AWS

To continuously update my Coinbase portfolio metrics, I would require just two components:

  • A scheduler that continuously triggered a portfolio update
  • An AWS function or module that would run the original script

For the scheduler, I chose to use AWS EventBridge. For running the script, I decided to package the code into a Docker container and run it with AWS Fargate on AWS Elastic Container Service. The whole setup looks like this:

This setup is pretty straightforward, but I will describe two more interesting pieces of the instrumentation in detail: improving the security of the script and debugging EventBridge failures.

Improving the Security of the Script

While the original script works, it has some security vulnerabilities that make it undesirable for the cloud:

  • It puts the Coinbase API Key and Secret directly into the Python code. You should generally never put sensitive information like passwords or keys directly in code. There are a lot of good reasons for this, and the link explains them better than I can.
  • It passes the Google Service Account credential to the script via a local file. Whoever possesses this credential has the ability to act as the service account. While nobody can access the file on a local machine, they actually can access the file once it is packaged into a Docker container. So, passing the credential via file works locally but not on AWS.
  • The Google Service Account is given Project Owner permissions in the original instructions. This is a very powerful permission that allows the Service Account to do anything possible in Google Cloud. In the wrong hands, the credential could be used to spin up virtual machines and mine crypto, leaving you with a staggering credit card bill. Some lucky bandits would eventually benefit from this post, but that is beside the point.

The first two problems can be solved with environment variables and AWS Parameter Store. Environment variables allow a developer to place secret information outside of the code. The code can still dynamically fetch this information via an interface. In Python, the language used for this script, it looks like this:

import os

key = os.environ.get('COINBASE_KEY')
scrt = os.environ.get('COINBASE_SECRET')
google_creds = os.environ.get('GOOGLE_COINBASE_CREDS')

When running the update script on your local machine, which is useful for development, you can simply make sure these variables are exported.

export COINBASE_KEY="XXXXXX"
export COINBASE_SECRET="YYYYYY"

On AWS, we can put the secret information into the Parameter Store. Then, when we set up our ECS Task Definition, we can use Parameter Store to pass the environment variables to the Docker container.

To make passing these variables as secure as possible, we can restrict the System Parameter permissions given to the Task Execution role (which executes the Docker Container on ECS) to include only these specific parameters. This granularity is nice because it allows us to follow the principle of least privilege when setting up the ECS task role.

Note that to securely pass the Service Account credential, I dumped the entire JSON blob into a System Parameter value. This string can be used instead of passing a file to the code. Because I found this change somewhat tricky to get right, I think it is worth pasting the code sample:

import json
import os
from google.oauth2.service_account import Credentials

google_creds = os.environ.get('GOOGLE_COINBASE_CREDS')
scopes = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
service_account_info = json.loads(google_creds)
credentials = (Credentials.from_service_account_info(service_account_info, scopes=scopes))

Remember, the final issue was the sweeping Project Owner role that was granted to the Service Account. It turns out that this role is totally unnecessary. It is a confusing story, but the permissions model used by Google Cloud (Google Cloud IAM) is completely separate from the permissions model used by Google Apps (Drive Sharing). The Service Account is actually given permissions to write to the Google Sheet via Drive Sharing (the little share button in the Docs UI). It will successfully write to the spreadsheet even with no IAM permissions at all! So, this issue is easily solved by giving the Service Account zero IAM permissions. This ensures that even if a hacker got ahold of the credential, the worst they could do is overwrite your Coinbase Portfolio spreadsheet. More details on these two permission models can be found here.

Debugging EventBridge Failures

After setting up my ECS Task Definition and EventBridge Rule, I sadly discovered that my spreadsheet was not being updated. I poked around in the EventBridge metrics, but all I could find was a graph of failed invocations over time.

This graph let me know that something was wrong, but it did not answer the obvious next question of why. As I dug around AWS and searched the web, I found that many users shared this consternation. Some claimed that scheduling a Fargate task from EventBridge was simply not possible. To make matters even more confusing, it seems that scheduling Fargate on EventBridge only got support in 2018, so it actually was impossible in the past.

Fortunately, I learned that AWS CloudTrail retains an event history, and I was able to find instances of the RunTask event that is triggered by the EventBridge scheduler.

Clicking one of the RunTask entries revealed a lot more detail about the event, including a very suspicious error message.

"errorCode": "InvalidParameterException",
"errorMessage": "Network Configuration must be provided when networkMode 'awsvpc' is specified.",

I learned that Fargate always runs with awsvpc network mode, and this means a corresponding network configuration must be provided in the EventBridge Rule (more details in this helpful post from Jimena Mora). It turns out that the EventBridge UI subtly hints that you must specify network configuration when using awsvpc, but it does not spell out that this always applies to Fargate, and it certainly does not use this information for form validation.

And so finally, with correct network configuration and secure variables, I discovered that my Coinbase portfolio was updated continuously! Mission accomplished.

Conclusion

This exercise taught me a lot about AWS debugging and Docker best practices. Some takeaways I had:

  • The AWS UI is pretty clunky, but it usually has what you want, somewhere…
  • Crypto taxes actually seems pretty complicated, so I would still consider paying for the CoinTracker tax report
  • The best way to avoid all of these shennanigans is to simply keep HODLing!

Finally, I should note that while my goal was to make a tool that was completely free, I have observed that the continuous updates seem to cost a few cents a day, so I did not entirely meet my goal. But, I think this is a good compromise for the learning I got and the decoupling of the job from my personal computer.

Postgres, Kafka, and the Mysterious 100 GB

Lately I’ve been trying to learn more about Amazon Web Services (AWS) and some other popular technologies. Since I work at a large tech company, I spend a lot of my professional time using proprietary software that is analogous to but very different from what most folks would use for the same task.

In this post, I’ll talk about a surprising challenge I encountered setting up an event-driven architecture proof-of-concept on AWS and how I debugged it. I think the experience shows that PoCs using “production infrastructure” can expose pitfalls that might appear in a real implementation. Please read on if you’re interested in hearing about how I accidentally paid AWS too much money and desperately tried to expose a nonexistent hacker.

In particular, I got interested in learning more about Apache Kafka, since it seemed like an effective way to communicate between micro-services. It turns out that Apache Kafka is offered on AWS as a managed service called MSK. And, since I learned that Kafka can siphon events from changes to a database via Kafka Connect, I decided to try the following setup:

  • Deploy an AWS RDS Postgres instance
  • Deploy an AWS MSK instance
  • Deploy a Kafka Connector that would send database change events to Kafka
  • Deploy a Kafka consumer ready to read database change events
  • Finally, observe that inserts to a database table result in events for the consumer

This seems like a lot of machinery, but an architecture diagram shows that it is relatively simple:

Note that MSK actually comes with Managed Connectors, but I had a hard time getting this to work (and it is significantly more expensive than running your own connector). So, I decided to use an open-source connector called Debezium, which supports Postgres out of the box.

After slapping together some tutorials, I had a local version of this architecture running (using Docker Compose to manage each of the containerized components). Then, I was able to create the same environment on AWS, as described in the diagram. Finally, in preparation for the Thanksgiving holiday, I turned down my ECS Services so that I could reduce charges. I left MSK and RDS up, since they took some time to configure properly.

The Problem Starts

I returned a few days later, located my saved CLI commands for turning up the ECS services, and started to pick up where I left off. Unfortunately, I noticed that the service tasks were failing because they could not connect to the database. I also observed that I could no longer connect to the database locally, so I went to the RDS dashboard. There, I discovered that my instance was in a non-responsive mode called “STORAGE_FULL”.

STORAGE_FULL meant that I had somehow filled up (and was currently paying for) 100 GB of disk space. I started to sweat, thinking that I had probably been hacked by hooligans who were turning my instance into some crypto mining machine. These hackers must have been pretty good, since my instance was in a Security Group that only allowed traffic from my IP address. But despite this precaution, I did find evidence of write traffic to the database during my break!

The next step was to inspect the instance itself and try to see if anything weird was there. It’s actually not even possible to check the instance until it has enough storage to get out of STORAGE_FULL mode. An AWS Support doc helped me get my database enough memory to log on and pointed me in the right direction of the culprit: the transaction logs. The corresponding graph clearly showed that my memory had been steadily eaten up by growing transaction logs.

The AWS docs go on to say that such an increase can be caused by replication slots, which “retain the WAL files until the files are externally consumed by a consumer” (more on WAL logs soon). And then, their suggested query to reveal the source of the logs increase showed something very suspicious:

At this point, I could reasonably conclude that rather than hackers, my misuse of Debezium led to this issue. And, since the problem was ongoing, it would probably even fill the new storage I have allocated to the database. So, I had to move from discovering the issue to understanding it.

Why Was This Happening?

To understand why something related to Debezium was using up all of my storage for a replication slot, we need to take a step back and learn a little more about how the connection between Postgres and Debezium works.

Debezium gets Postgres change events by reading the Write-Ahead Log (WAL), a file that contains all recent changes to the data. This WAL allows the database to recover from failure, since any changes to the data are only written after they have been successfully captured in the WAL. By default, the WAL does not grow larger forever. Instead, Postgres intelligently recycles WAL segments after they have been included in a checkpoint (and using a number of other configurable factors described in WAL Configuration).

In order for Kafka to guarantee that it processes every event, it requires WAL events to persist until they are consumed by the Kafka Connector. Otherwise, the Kafka Connector could fall so far behind that it would permanently lose certain events. These events waiting to be consumed are placed in a replication slot specific to the connector. So, doing something like turning off your Kafka Connector forever could result in your WAL events sadly piling up in the replication slot, hoping to one day be consumed…

And yet, why were new events getting added to the WAL at all, since the database was unused? And what was causing that mysterious write traffic? Fortunately, I found the answer to this final question after I carefully read Debezium’s section on WAL Disk Consumption. There, at the bottom of the section, was a very special quote inscribed in a “tip” box:

Just for confirmation, looking at my database traffic with one-minute granularity does reveal constant, small traffic every 5 minutes.

At last, we have complete information:

  • AWS RDS periodically writes to the database at all times
  • By default, a replication slot continues to grow until its contents are consumed by a client
  • Debezium requires a replication slot for getting events into Kafka
  • When Debezium goes down, the replication slot endlessly grows in size

The Fix

So now that we know the problem, how can we fix it? Clearly, we do not want to live in a world where a failed Kafka Connector can result in a complete outage of our database. This could be a significant risk for a production application.

Stopping the Bleeding

In the short term, I needed to get that disk space back. This is as straightforward as simply dropping the huge replication slot:

psql=>select pg_drop_replication_slot('debezium');

Note that turning Debezium back on and letting it catch up could also reduce the size of the slot.

Preventing A Recurrence

At a certain point, we have to be comfortable with the fact that if Debezium is down for long enough, then it will permanently miss some events. This is certainly better than bringing the whole database down instead. To do that, we can put a ceiling on our replication slot size using max_slot_wal_keep_size. This was the solution I came up with after searching through all RDS params with “wal” in them, but it is also endorsed by a few other sources.

Finally, we could also bolster our defenses by setting an alert for whenever the transaction logs grow too large. This is described in more detail in this post.

Concluding Thoughts

I’m glad I went through this exercise because it helped me scratch the surface of maintaining production infrastructure for Postgres and Kafka. These systems are definitely not meant to be intermittently used during development. Instead, they expect to be hyperactive and online, as they should!

I was also reminded of the fact that any cloud provider always introduces an extra layer of complexity between the developer and the underlying technology. In this case, a special RDS property broke some assumptions I had about the database’s usage. This is often the case with managed services, where some management layer does not behave as you would expect.

Nonetheless, I would still use these technologies in a real application. They’re certainly robust, the debugging was not so hard, and neither was the proof-of-concept setup.