Titan Engine: 6X Write Performance by Separating Large Value From LSM-Tree
Background
TiKV project started in 2015 initially as the storage layer of TiDB. In the early version, we used 2 RocksDB instances to store raft-log and data respectively for fast development reasons (actually the very first version of TiKV used single RocksDB to store raft-log and data). And then we observed that the write amplification of TiKV is relatively high. The high write amplification is mainly caused by two parts: 1) raft-log data has a FIFO pattern, using LSM-Tree engine to store it is not efficient; 2) rewrite large values in LSM-Tree again and again during the compaction cause a lot of write amplification for the data part.
For the raft-log part, I started the raft-engine project in 2017 to resolve this issue. For the second part, I noticed the paper Wisckey around 2017 and gave an internal talk. This paper represented a method that reducing write amplification by separating values from LSM-Tree, instead of store key values in the LSM-Tree, it using dedicated blob files to store these large values.
Separating Large Values from LSM-Tree
My colleague Huachao Huang started building the titan engine to separate large values from LSM-Tree in 2018. And then other colleagues like Bokang, Xinye, Andy, and more contributed to this project. At that time there was no such open source project, we had to build it from scratch. The main design was referred to the Wisckey paper, but with some changes. The significant difference between Titan and Wisckey is that Wisckey doesn’t write large value to the Write Ahead Log, Wisckey directly writes the large values into blob file which result in less write amplification. In Titan, we write the large values in the WAL file and only generate blob files when flushing memtables to persistent SST files. There are two reasons why we chose this strategy: 1) The atomic write and data integrity are guaranteed only by WAL, and each write only needs to fsync the WAL file; 2) We designed the Titan engine as a plugin for RocksDB, so we didn’t change the critical write path and moved the separating large value logic to the memtable flushing stage. Compression is implemented in each blob level.
GC Mechanism
When deleting a key value from RocksDB, a deletion tombstone will be written to RocksDB. When the tombstone encounters its corresponding key value during the compaction, the key value will be dropped.
When it comes to Titan, after the key and pointer in the LSM-Tree has been dropped, the corresponding blob becomes invalid. As more and more blobs became invalid, there must be a way to GC these blobs and reclaim disk space.
Rewrite Blob Files
The first thing of GC mechanism is tracing which blob file is worth to GC. We used RocksDB SST’s user property block to record how large of the valid blobs in corresponding blob files. As showed in below diagram, sst file1 has two corresponding blob files 1 and 2. Blob file1 and blob file2’s corresponding valid blob sizes contributed from this SST file1 are 564K and 1028K respectively.
If we accumulate all SST files’ user property records, we should know the valid blobs sizes of each blob file. And then compare the valid blob size to their real size, if the valid blobs size ratio less than a threshold (by default is 0.5), it indicates that this blob file contains too many invalid blobs and is needed to GC.
When TiKV is started, we will collect all SST files’ user property and record each blob file’s valid size. After each compaction’s complishment, this record will be updated.
After judging which blob file should be GC, Titan will rewrite this blob file, it will check if there is a corresponding pointer exist in the LSM-tree for each blob, and generate a new blob file just contains valid blobs. Meanwhile, write the keys and their new position back to the LSM-Tree. The old blob file can be removed if there is no other read request that needs it.
Punch Hole
The default Titan GC threshold is 0.5 (discardable-ratio = 0.5), which means we will rewrite blob files whose invalid blobs occupy more than 50% of the blob file space. 0.5 also means there might be as much as 50% of disk space occupied by invalid blobs. Some of users complained that after they turn on the Titan engine, the disk space usage nearly doubled. Definitely users can set a smaller threshold like 0.2 to make the the GC more aggressively and reduce the space amplification, but this will cause the write amplification increase.
Is there a way that can reduce space amplification during GC and not increase write amplification? After some discussion, Bokang and Andy figured out the punch hole method. The basic idea of punch hole GC is that each blob is aligned with 4K in the blob file. When the invalid blobs size exceeds the threshold like 0.2 in the blob file, we use fallocate with FALLOC_FL_PUNCH_HOLE flag to deallocate space for these invalid blobs, instead of rewriting the whole blob file. In this way, there is no extra writing for GC and the space of invalid blobs can be reclaimed in time.
You may notice that, after we align each blob with 4K, there might be padding after a blob’s tail. The default large value threshold in TiKV is 32K, and the padding may occupy 0~11% extra space, which is acceptable.
Performance Testing
We integrated Titan with TiKV/TiDB and tested various workloads like point-get, range-scan 100, range-scan 10000, update, over row sizes various from 1KB, 2KB to 32KB. We found that Titan can gain significant performance improvement for writing heavy workloads like updates. And for reading only workload like point-get and range scan, titan also can gain competitive performance, this is partly because after separating large values from the LSM-tree, more keys can be cached in the block cache.
Update
For update workload, when the row size is 1KB Titan gained twice QPS compare to RocksDB, and Titan gained 6X QPS compare to RocksDB when the row size is 32K.
Range Scan
In the LSM-Tree, scanning a range essentially grabs key values from different levels’ iterators and executes merge sorting. After separating large values into blob files, there is a random read for each key to retrieve the large value from blob file. This might reduce scan performance. We did a comprehensive scan tests and had following result: when the row size is 16K or larger, Titan has