by

Postgres, Kafka, and the Mysterious 100 GB

Lately I’ve been trying to learn more about Amazon Web Services (AWS) and some other popular technologies. Since I work at a large tech company, I spend a lot of my professional time using proprietary software that is analogous to but very different from what most folks would use for the same task.

In this post, I’ll talk about a surprising challenge I encountered setting up an event-driven architecture proof-of-concept on AWS and how I debugged it. I think the experience shows that PoCs using “production infrastructure” can expose pitfalls that might appear in a real implementation. Please read on if you’re interested in hearing about how I accidentally paid AWS too much money and desperately tried to expose a nonexistent hacker.

In particular, I got interested in learning more about Apache Kafka, since it seemed like an effective way to communicate between micro-services. It turns out that Apache Kafka is offered on AWS as a managed service called MSK. And, since I learned that Kafka can siphon events from changes to a database via Kafka Connect, I decided to try the following setup:

  • Deploy an AWS RDS Postgres instance
  • Deploy an AWS MSK instance
  • Deploy a Kafka Connector that would send database change events to Kafka
  • Deploy a Kafka consumer ready to read database change events
  • Finally, observe that inserts to a database table result in events for the consumer

This seems like a lot of machinery, but an architecture diagram shows that it is relatively simple:

Note that MSK actually comes with Managed Connectors, but I had a hard time getting this to work (and it is significantly more expensive than running your own connector). So, I decided to use an open-source connector called Debezium, which supports Postgres out of the box.

After slapping together some tutorials, I had a local version of this architecture running (using Docker Compose to manage each of the containerized components). Then, I was able to create the same environment on AWS, as described in the diagram. Finally, in preparation for the Thanksgiving holiday, I turned down my ECS Services so that I could reduce charges. I left MSK and RDS up, since they took some time to configure properly.

The Problem Starts

I returned a few days later, located my saved CLI commands for turning up the ECS services, and started to pick up where I left off. Unfortunately, I noticed that the service tasks were failing because they could not connect to the database. I also observed that I could no longer connect to the database locally, so I went to the RDS dashboard. There, I discovered that my instance was in a non-responsive mode called “STORAGE_FULL”.

STORAGE_FULL meant that I had somehow filled up (and was currently paying for) 100 GB of disk space. I started to sweat, thinking that I had probably been hacked by hooligans who were turning my instance into some crypto mining machine. These hackers must have been pretty good, since my instance was in a Security Group that only allowed traffic from my IP address. But despite this precaution, I did find evidence of write traffic to the database during my break!

The next step was to inspect the instance itself and try to see if anything weird was there. It’s actually not even possible to check the instance until it has enough storage to get out of STORAGE_FULL mode. An AWS Support doc helped me get my database enough memory to log on and pointed me in the right direction of the culprit: the transaction logs. The corresponding graph clearly showed that my memory had been steadily eaten up by growing transaction logs.

The AWS docs go on to say that such an increase can be caused by replication slots, which “retain the WAL files until the files are externally consumed by a consumer” (more on WAL logs soon). And then, their suggested query to reveal the source of the logs increase showed something very suspicious:

At this point, I could reasonably conclude that rather than hackers, my misuse of Debezium led to this issue. And, since the problem was ongoing, it would probably even fill the new storage I have allocated to the database. So, I had to move from discovering the issue to understanding it.

Why Was This Happening?

To understand why something related to Debezium was using up all of my storage for a replication slot, we need to take a step back and learn a little more about how the connection between Postgres and Debezium works.

Debezium gets Postgres change events by reading the Write-Ahead Log (WAL), a file that contains all recent changes to the data. This WAL allows the database to recover from failure, since any changes to the data are only written after they have been successfully captured in the WAL. By default, the WAL does not grow larger forever. Instead, Postgres intelligently recycles WAL segments after they have been included in a checkpoint (and using a number of other configurable factors described in WAL Configuration).

In order for Kafka to guarantee that it processes every event, it requires WAL events to persist until they are consumed by the Kafka Connector. Otherwise, the Kafka Connector could fall so far behind that it would permanently lose certain events. These events waiting to be consumed are placed in a replication slot specific to the connector. So, doing something like turning off your Kafka Connector forever could result in your WAL events sadly piling up in the replication slot, hoping to one day be consumed…

And yet, why were new events getting added to the WAL at all, since the database was unused? And what was causing that mysterious write traffic? Fortunately, I found the answer to this final question after I carefully read Debezium’s section on WAL Disk Consumption. There, at the bottom of the section, was a very special quote inscribed in a “tip” box:

Just for confirmation, looking at my database traffic with one-minute granularity does reveal constant, small traffic every 5 minutes.

At last, we have complete information:

  • AWS RDS periodically writes to the database at all times
  • By default, a replication slot continues to grow until its contents are consumed by a client
  • Debezium requires a replication slot for getting events into Kafka
  • When Debezium goes down, the replication slot endlessly grows in size

The Fix

So now that we know the problem, how can we fix it? Clearly, we do not want to live in a world where a failed Kafka Connector can result in a complete outage of our database. This could be a significant risk for a production application.

Stopping the Bleeding

In the short term, I needed to get that disk space back. This is as straightforward as simply dropping the huge replication slot:

psql=>select pg_drop_replication_slot('debezium');

Note that turning Debezium back on and letting it catch up could also reduce the size of the slot.

Preventing A Recurrence

At a certain point, we have to be comfortable with the fact that if Debezium is down for long enough, then it will permanently miss some events. This is certainly better than bringing the whole database down instead. To do that, we can put a ceiling on our replication slot size using max_slot_wal_keep_size. This was the solution I came up with after searching through all RDS params with “wal” in them, but it is also endorsed by a few other sources.

Finally, we could also bolster our defenses by setting an alert for whenever the transaction logs grow too large. This is described in more detail in this post.

Concluding Thoughts

I’m glad I went through this exercise because it helped me scratch the surface of maintaining production infrastructure for Postgres and Kafka. These systems are definitely not meant to be intermittently used during development. Instead, they expect to be hyperactive and online, as they should!

I was also reminded of the fact that any cloud provider always introduces an extra layer of complexity between the developer and the underlying technology. In this case, a special RDS property broke some assumptions I had about the database’s usage. This is often the case with managed services, where some management layer does not behave as you would expect.

Nonetheless, I would still use these technologies in a real application. They’re certainly robust, the debugging was not so hard, and neither was the proof-of-concept setup.

Write a Comment

Comment

  1. Why was the Kafka connector failing at all? Was it failing to send the events to MSK? If the write to MSK was successful, was it somehow waiting for those records to be processed at the other end of the Kafka stream?

    • The Kafka Connector was actually failing because I had turned it down during the holiday weekend to avoid charges. So it was actually failing, it just wasn’t up at all. But once you configure Postgres for the connector, then Postgres will continue piling up the logs, waiting for the connector to come back.