Reduce 45% Write IO For TiKV By Introducing Compaction Guard

Jinpeng Zhang
3 min readJul 3, 2024

--

RocksDB Leveled Compaction and Write Amplification

TiKV adopted RocksDB (LSM Tree KV engine) as the persistent storage engine, and use RocksDB’s leveled compaction as the default compaction strategy to satisfy TiKV’s single-digit millisecond latency read/write requirement.

RocksDB leveled compaction picks one file from the source level and compacts to the next level, which is a merge sort compaction algorithm. Typically the leveled compaction write amplification can reach as high as 30, which means write 1M user data may cause 30MB write in RocksDB in total.

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

This paper presents PebblesDB, a high-performance keyvalue store that achieves low write amplification, high write throughput, and high read throughput simultaneously. The key idea of FLSM (Fragmented Log-Structured Merge Trees) is split SST files by “Guard” during compaction to reduce the rewrite of content in the same level.

Fragmented LSM-Tree

As shown in the above diagram, Guard b and Guard e build a boundary in L1 to L6 that there is no SST file’s range can across the boundaries, every SST file either at the left of a boundary or at the right of a boundary. Guard c builds a boundary in L6 that there is no L6 SST file’s range across this boundary.

By doing this, data in L1 to L6 is aligned and isolated by the guards, and compaction jobs between 2 levels just involve related data. In this paper, PebblesDB’s write amplification was reduced by nearly 60% compare to RocksDB.

Write Amplification Reduction of FLSM

Split SST Files by Data Range in TiKV

Data in TiKV is partitioned into many data ranges, each data range is replicated to multiple TiKV nodes by Raft. But there is only one RocksDB instance shared by multiple data ranges in one TiKV node. Different data ranges may contain data with different prefixes or from different tables. A compaction job may contains SST files overlap with multiple data ranges, mix unrelated data together and introduce extra write amplification. If we split SST files by data ranges, these extra write amplification can be eliminated.

Split SST Files by Data Ranges in TiKV

So we did an experiment by hacking RocksDB code and split SST files by data ranges, the the test result was promising. For bulk insert workload, 45% write amplification was reduced, for update workload nearly 20% write amplification was reduced.

Write IO reduction of Compaction Guard

After confirmed the benefits of compaction guard, our colleague Yi Wu sent a PR to add interface to customize RocksDB compaction output file boundary, and then officially introduce this compaction guard optimization into TiKV here.

Other Benefits of Splitting SST Files by Data Ranges

Besides the IO optimization benefit, splitting SST files by data ranges has another benefit: clean a data range is more efficient. In above compaction guard diagram, let’s use cleaning the replica of data range [a, b) as an example, the main processes are:

  1. TiKV use RocksDB’s DeleteFilesInRange interface to remove these SST files fully covered by range [a, b) first;
  2. And then TiKV scans and deletes the remaining keys for this range.

After we split SST Files by data ranges, most data of a data range can be removed by step 1, which has a great improvement in terms of cleaning speed and resource consumption.

--

--

Jinpeng Zhang
Jinpeng Zhang

Written by Jinpeng Zhang

Director of Engineering @ TiDB, focus on building large scale distributed system and high performance engineering team.

No responses yet