LSM Tree, Tombstones and YugabyteDB 🚀

9 min readFeb 12, 2023

This is a re-post from dev.to which is the main place where I blog:
https://dev.to/franckpachot — original article from Nov 8, 2022

During Halloween, I’ve heard a creepy story about tombstones, and how database performance may decrease when data dies. It was a nice and funny video, by the way, but I let you find it yourself on the web. I just wanted to be sure that, with the right design of tables, dead data doesn’t decrease performance in YugabyteDB. I take an extreme example with a pattern that has always been difficult for databases: heavy inserts and deletes like in queuing tables.

Distributed SQL databases do not update rows in-place, as it the case in traditional databases with Heap Tables and B-Tree indexes, because this would require sharing blocks across multiple nodes, which does not scale when network latency increases. Another structure is used, easier to distribute, and also efficient for writes on SSD: LSM-Tree (Log Structure Merge Tree).

In LSM-Tree, all changes are appended as a sequence of changes (Log Structure). For fast writes, the first level of the Tree is a logically sorted structure in memory (MemStore) which is then flushed to SST Files (Sorted Sequence Table).

When you read a value, it can be in the MemStore or one of the SST Files. Bloom filters are maintained when writing SST Files to avoid reading all of them. To reduce the number of files, a background compaction occurs, reading multiple files to Merge them into one.

Those structures hold all intermediate changes, which is needed for MVCC (Multi-Version Concurrency Control), the base of non-blocking consistent reads. It is also used to provide cluster-wise consistent snapshots for backups. But after some configurable retention (by default 15 minutes) we don’t need those intermediate versions anymore. The compaction will also take care of this garbage collection. This is very different from PostgreSQL vacuum because this is run at a lower level, physical, where there’s nothing blocked on the current SQL running.

What are those intermediate changes? An INSERT will be a document with all column values. An UPDATE will be the new version of the column values. Those are sequenced with a timestamp (Hybrid Logical Clock) to find the right version for the MVCC read point. DELETE will store only the key, and the timestamp, with the marker that the row was deleted. This is called a tombstone as it marks the end of the row life. It will stay in the LSM-Tree for the MVCC retention, after which a compaction can remove some intermediate versions, and a full compaction will remove all the entries for this key.

Now, this blog post is about the consequence of this. Imagine an extreme case where you constantly insert and drop a few rows. Logically, the table is small. But to be able to serve any read point for the MVCC retention, it will be stored physically as many new rows and tombstones.

To show it, I’ve run the following queries every 10 seconds:

insert 42 rows
delete all rows
measure the table size (with pg_table_size())
run a point query and measure the response time

The goal is to see the table physically increasing with the MVCC versions, and the consequence on a full table scan. We are looking at the worst case here.

In order to display it, I’ve run all this from a Grafana dashboard where I have 4 panes:

Inserts

To make it easy, I create the table there if it doesn’t already exist, and insert 42 rows, rather large ones (100 KiB):

create table if not exists demo (id bigint, value text);

explain (analyse, costs off)
 insert into demo
 select generate_series(1,42),rpad('x',100*1024,'x');

Deletes

As for the insert above, I display the execution plan

explain (analyse, costs off)
 delete from demo;

Table Size

I collect the table size, with the current timestamp, into a table that I create when it doesn’t exist already. I delete the old values that are beyond the time range selected for the dashboard. And insert the row with the size collected from pg_table_size()

create table if not exists metric_table_size(time_column timestamp, size bigint, name text);

delete from metric_table_size where not($__timeFilter(time_column));

insert into metric_table_size
select now() at time zone 'UTC',
 pg_table_size(c.oid),format('%I.%I',nspname,relname) 
from pg_class c
 natural join (select oid relnamespace, nspname from pg_namespace) as ns 
where relkind='r' and relowner!=10;

SELECT $__time(time_column), name, size FROM metric_table_size 
WHERE name = 'public.demo' and size>0 and $__timeFilter(time_column)
order by 1,2,3
;

Of course, there are better ways to run queries and simulate a workload, but I want to impress you with my Grafana skills 😂

Query time

With the same idea of storing the history and then querying it, I run select * from demo where id=1. It will return either no rows, or one, depending on when it runs (after the insert or the delete).

create table if not exists metric_query(time_column timestamp, ms float, rows int, name text);
delete from metric_query where not($__timeFilter(time_column));
do $$ 
 declare
  plan json; 
  query text := $sql$ select * from demo where id=1 $sql$;
 begin
  perform pg_sleep(1);
  execute format('explain (format json, analyze) %s',query) into plan;
  insert into metric_query values(
   now() at time zone 'UTC',
   (plan->0->'Plan'->>'Actual Total Time')::float,
   (plan->0->'Plan'->>'Actual Rows')::int,
   query
  ); end; $$;
SELECT $__time(time_column), format('%s (%s rows)',name,rows) as name, ms FROM metric_query 
WHERE  $__timeFilter(time_column)
order by 1,2,3
;

One hour without Primary Key

If you look at the following graph of response time, in the bottom, you will see it increasing for my point query. This looks like the Halloween story I’ve heard. But, if you looked carefully at the CREATE TABLE statement, you may have seen that it doesn’t follow the most important recommendation: have a primary key.

Here is the result after about one hour:

The size increases. It includes the WAL, that protects the MemStore, and SST files. The first INSERTs and DELETE’s tombstones go to the MemStore which is protected by the WAL. This is what we see increasing in the first minutes.

When the MemStore reaches 128 MiB (--memstore_size_mb=128), it is flushed to a new SST File and then we see the total size doubling quickly. My table never had more than 42 rows, but the WAL contains all changes, for no-data-loss High Availability, and the SST Files also contain all changes, for non-blocking transaction isolation with MVCC.

Note that, in addition to this RegularDB MemStore, all changes go first to the IntentsDB, which holds the transactions provisioning records, to be able to commit or rollback atomically. As I am in autocommit, this will stay small, but still need to generate WAL.

The WAL is kept 15 minutes by default (--log_min_seconds_to_retain=900 --log_min_segments_to_retain=2) because, in addition to protecting the MemTable, it can be used to synchronize a node that was temporarily down, so that it can get catch-up the gap without copying whole files.

The intermediate versions are kept 15 minutes by default (--timestamp_history_retention_interval_sec=900) to allow long queries (in read committed, log transaction with higher isolation level) to build the snapshot for their read point. Then, the background compaction will start to reduce the size.

This explains why the size still increases after the MemTable is flushed. The table size starts to reach its working size after 15 minutes. Here, the size is about 580MiB. Remember that my rows are large (100 KiB), with 42 inserted and deleted every 10 seconds. After 15 minutes, this is 42 * 100 * (900 / 10) / 1024 = 340 MiB of raw data, plus all transaction metadata and tombstones. Note also that the size is from the Raft leaders only. The size in the whole cluster is multiplied by the replication factor.

Note that the RAM size in the MemStore is not accounted here, but only the WAL that protects it, which goes to disk. You can see it from the Tablet Server’s Web Console, both IntentsDB and RegularDB have reached their maxiumum of 128 MiB:

Id                                                                                     Current     Peak    Limit
-------------------------------------------------------------------------------------  -------  -------  ------- 
tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root                         231.27M  256.87M none
 IntentsDB->tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root             115.96M  128.79M none
  MemTable->IntentsDB->tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root  115.96M  128.79M none
 OperationsFromDisk->tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root         0B       0B none
 RegularDB->tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root             115.31M  128.08M none
  MemTable->RegularDB->tablet-374f3787422544c181beb2f24fce7655->Tablets->server->root  115.31M  128.08M none

So, the table size is mostly WAL, Write Ahead Logging, which has to be read only when there’s a node failure. Then I would not expect the response time to increase as we can see above. I read only one row (select * from demo where id=1) out of at most 42 ones. But you see the response time increasing up to 150 milliseconds, and this increase starts to stop only after 45 minutes. Are those tombstones hurting performance?

With Primary Key

The above is not optimal. I’ve no primary key in my table, and then the query has to full scan the table. This has to sequential scan the LSM Tree, and rows can be everywhere in those MemStore and SST Files, which hold all intermediate values.

I have to fix this and change the first query to define the id as the primary key:

create table if not exists demo (id bigint primary key, value text);

explain (analyse, costs off)
 insert into demo
 select generate_series(1,42),rpad('x',100*1024,'x');

Note that you should always have a primary key. The support of tables without primary keys is there only to be compatible with PostgreSQL which stores tables in Heap Tables and then allows to create the primary key later, like a secondary index. I did the above only to show the problem. There’s no reason to create a table without primary key.

I have run the same again with this table definition. Note that in order to restart my test easily, I’ve put the following to drop tables in a Grafana dashboard variable that is executed on Dashboard load:

drop table demo;
drop table metric_table_size; drop table metric_query;

This has been running for one hour. The table size (mostly WAL) is the same as above, because the writes are the same. Adding a Primary Key to a table has no overhead because the table is stored in its primary key. The size on disk grows, but I stay with nice predictable reads, lower than one millisecond:

I have detailed the reads that happen between the insert and delete, where one row is found out of 42 ones, from the ones that return no rows. The point query is even faster with no rows because there is not this 100 KiB row to send to the SQL layer.

It is well-known that LSM-Tree are good for writes, as all changes are an appended to an in-memory structure. YugabyteDB also optimizes reads, because that’s what most SQL workloads do with their data. The first level of the LSM-Tree, the MemStore, may generate large WAL, to protect it, but it also accelerates the reads by keeping the latest 128 MiB of changes in RAM. When you read from files, there are also other optimizations, a YugabyteDB cache, and then the filesystem cache. This blog post was there to show that nothing is bad with tombstones in YugabyteDB, even on a table with heavy deletes. They are there actually to accelerate point queries so finding the absence of row is as fast as finding an existing row.

Finally, I realized I mentioned the queue table pattern when events are inserted and picked and deleted. Those usually read the first rows, like order by id fetch first 10 rows only so I've run the same which stays lower than 10ms when reading 10 rows. Note that this is still a SeqScan because the primary key is HASH sharded. I've also tested later when adding 'for update skip locked' this goes to 25 milliseconds, but both are still constant without suffering from tombstones: