How TiKV Mitigate Background Compaction Jobs’ Interference
Compaction jobs in LSM-Tree engine
TiKV uses RocksDB as the disk persistence engine. RocksDB is a typical LSM-Tree KV engine, and it provides a lot of useful features like column families, multi-thread compaction, sub-compaction for level 0, etc. In LSM-Tree engines, data is organized in multiple levels (sst files), compaction jobs merge old sst files and output new sst files. By doing compaction, LSM-Tree engines can drop deleted data as well as accelerate read performance by keeping data well organized. For more basic knowledge about LSM-Tree structure, please refer to LSM-Tree wiki.
TiKV choose RocksDB’s leveled compaction because TiKV care about foreground requests’ latency most and then the overall read/write throughput. Leveled compaction compact sst files from 2 adjacent levels, and output the new generated sst files to the more closer the bottom level. The basic process is: a) picking up a sst file in level L; b) finding all overlapped sst files in level L+1; 3) drop deleted values if there are and compact them togather, output the new generated sst files into level L+1. The following diagram shows a compaction job involves level1 and level2 sst files.
Background compaction jobs’ interference to foreground requests
Compaction jobs can reclaim disk space by dropping deleted KV pairs, also can increase read performance by making the LSM-Tree always in a good shape. But compaction jobs also have interference to these foreground requests. This is because they will compete resources with foreground requests. Compaction jobs read input sst files and do merge sort and write them into new sst files. You can tell that compaction jobs eat read/write IO resources and CPU resources. The compaction jobs probably introduce IO spike and CPU spike which can introduce latency jitter issue for foreground requests like put/get requests. As a latency sensitive storage system, TiKV should handle this type of problem.
Throttle the background compaction jobs
In the first version of TiKV, we used RocksDB’s `rate_bytes_per_sec` to throttle the max read&write IO bandwidth compaction jobs can be used. From our experience, we suggested setting it to 75%~80% of the max IO bandwidth of the disk, and reserve at least 20% IO bandwidth for foreground requests.
This works for many cases, the max IO used by compaction jobs is throttled. But we also experienced IO spiking issues as the following diagram shows. The background compaction jobs’ IO usually reach to the `rate_bytes_per_sec` in seconds, and then go away quickly. These IO spikes have a bad impact on foreground requests.
Smooth the background compaction jobs’ IO with auto-tune
RocksDB 5.9 introduced the auto-tune feature for the RateLimiter. The basic idea of auto-tune is smoothing the IO of background compaction jobs. The start rate limit is relevant smaller compare to the `rate_bytes_per_sec`(for example 1/10 of `rate_bytes_per_sec`). And then the auto-tune increases the limit periodically (for example 10s) if the quota continues is starvation in the past windows, until it reaches `rate_bytes_per_sec`. For example the first 10 seconds’ rate limit is 30MB/s, and then the 2nd 10 seconds’ rate limit might be 60MB/s, and then the 3rd 10 seconds rate limit is 120MB/s, etc. The auto-tune feature reduces the rate limit vice versa.
From above diagram you can tell that the auto-tune can smooth the IO usage of compaction jobs, and also may lower the peak IO bandwidth. There is an interesting question: seems the auto-tune makes no difference after the limit reaches `rate_bytes_per_sec`, for example during the time range T2 to T3 in following diagram. Yes, auto-tune doesn’t reduce total work of background compaction jobs, it just smooth the IO spikes. If there are a large amount of data need to be compacted, and compaction jobs are pileup, TiKV’s flow control mechanism will take effect to throttle the write throughput and make sure the LSM-tree engine is always in the healthy shape and can provide low latency reading service.
Control other background jobs
Besides background compaction jobs, there are other background jobs like backup jobs, import jobs(lighting). All these background jobs have a similar impact to foreground requests as compaction jobs. TiKV introduced the IORateLimiter to manage different kinds of IO.
Eliminate huge compaction jobs
RocksDB has a `max_compaction_bytes` option to try its best to limit the compaction job size when picking input sst files. TiKV set this option to 2GB by default. But it can’t prevent large compaction jobs. If one sst file (for example in L1) overlapped with a lot of files in the next level (L2), when compact this sst file, all overlapped sst files in the next level should be compacted together. A huge compaction job has a lot of drawbacks like: it needs large extra disk space, and has the risk make TiKV out of disk space; the huge compaction job may last for several hours and may block other compaction jobs in the same level, etc.
TiKV used to experience a huge compaction job which consumed 600GB extra disk space in a TiKV node, this TiKV node had just 2TB data. The root cause is there was a sst file overlapped with a massive range in the next level. After this “incident”, TiKV introduced a mechanism to split sst files in advance if it detects that there might be a huge compaction job in the future. The basic idea is when compaction output new sst files, TiKV split the new generated file if TiKV check that this new generated sst file overlaps with a lot of sst files in its next level. For example, there is a compaction job compact sst files from L1(fileA) and L2(fileB FileC FileD), generate a new sst file fileE into L2. Before output FileE into L2, TiKV will check if this new fileE overlaps with a lot of files in L3, if so TiKV will split fileE into several files like fileE1 fileE2 … to make sure there is no huge compaction jobs in the future.