Compare TiDB with DynamoDB

Jinpeng Zhang
7 min readOct 8, 2022

Preface

DynamoDB published a paper to revisit the 10 years’ journey DynamoDB has passed. As I read this paper, I felt a sense of intimacy and familiarity, as if it was telling the story of everything TiDB has been through in the last 7 years. Data read/write hotspot issues, high availability issues, network isolation issues, data persistence issues, and more. As a distributed SQL database, SQL is much richer than nosql in terms of semantic expressiveness. Consequently, TiDB faces more challenges than DynamoDB, like SQL optimizer, distributed push down queries, etc.

Predictable performance at any scale

TiDB and DynamoDB both are designed as distributed databases that scale horizontally in terms of data volume and performance.

In TiDB, the data is partitioned by range, a partition of data in TiDB is called region/range. A partition can automatically split in TiDB when the size of this partition reaches a threshold. The default partition threshold of TiDB is 96MB(from 6.1, and the partition size can be up to 10GB since TiDB lunched the dynamic-region feature). When the read/write load of a partition is high, the partition in TiDB also can trigger auto split and distribute the new partitions to different storage nodes. In this way, TiDB can handle traffic skew issues and make good use of distributed resources to provide predictable performance as the applications’ data volume and QPS grow. This feature is called load-based split in TiDB.

In DynamoDB, the data is also partitioned by range, and distributed these partitions to different storage nodes. Partition in DynamoDB can also auto split into child partitions by size or by throughput. The difference between TiDB & DynamoDB here is Global Admission Control(GAC), Partition quota, storage node quota,

High Availability

Each partition of data has 3 replicas both in DynamoDB and TiDB. In DynamoDB, Paxos is used to guarantee the data consistency between 3 replicas, and in TiDB, Raft is adopted. Because there are many partitions in total, so in DynamoDB it is called Multi-Paxos, in contrast it is called Multi-Raft in TiDB.

In TiDB Cloud , TiDB is deployed across 3 AZs by default. TiDB can tolerate a whole AZ failure. DynamoDB is also deployed across 3 AZs, and also can tolerate a whole AZ failure.

When temporary network isolation between one replica and the other two replicas happens (also called gray network issues), the isolated replica will trigger a new election and will not receive the other two replicas’ votes, and this will cause this replica continuously trigger new elections and increase its term. After the network issue recover, the high term will disturb the availability of this paxos/raft group. In TiDB, we use the `pre-vote` feature to eliminate such interference caused by temporary network isolation. Pre-vote can be used to introduce a pre-election by possible leaders, allowing them to gauge their ability to become a leader before potentially disrupting a cluster. More about Raft.

In DynamoDB, to solve the availability problem caused by gray failures, a follower that wants to trigger a failover sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election. This change in the failure detection algorithm used by DynamoDB minimized the number of false positives in the system, and hence the number of spurious leader elections.

Durability

In TiDB Cloud, cloud-disk AWS EBS/GCP persistent disk are used to store data. In contrast, DynamoDB still using local disks to store the data due to historical reason. Cloud-disk has higher durability & availability than local disk. For example, EBS has 99.8~99.999% durability, local disk usually only has 98% or 99% durability. 3 replicas store on a cloud disk with durability 99.9% respectively can achieve 99.9999999% durability.

DynamoDB contiouously archive write-ahead log files and data files to S3 (object storage), which has 99.999999999% durability.

In contrast, TiDB Cloud will backup a snapshot of your database to S3 every day, and incrementally backup new write data in an interval of 5 minutes.

In DynamoDB, when one partition replica goes down, a log replica that only persists recent write-ahead log entries of this partition will be added at the first step, because it is faster than copying the whole replica. In this way, DynamoDB can increase the high durability of most recent writes. And then add the data part of this replica.

In TiDB, when one partition replica goes down(If one replica doesn’t send any heartbeat in 30min, it will be treated as down), it will trigger adding the whole copy of replica, including data part and latest write-ahead-log(raft log) part.

Both TiDB and DynamoDB’s log files and data files use checksum to guarantee the integrity and correctness of the data.

Metadata Management

DynamoDB built an in-memory distributed datastore called MemDS to store all metadata in memory. Every request will request MemDS to ask for the route information(which storage node hosts the data partition).

In TiDB, PD maintain the whole route information by periodically receiving TiKV’s heartbeat. When there is a data replica location change or partition split, the storage node(TiKV) will send a heartbeat to PD as soon as possible to guarantee the PD has the latest route information. TiDB also cache the route information by a LRU mechanism. When a request come to TiDB, it will check the local cache first to see which storage node this request should send to. If there is no route information for this key range, TiDB will pull the route information from PD. PD may contains stale route information, but it doesn’t matter, when a request send to a wrong storage node, the storage node will tell the request there is no such key range, and where it should go if the storage node know the correct route information. And this latest information will also be updated in TiDB’s cache, so the following requests of this partition can use the latest route information.

Strong consistency of writing and read

In TiDB, all read and write requests will be route to the leader replica by default. In this way, the clients always read the latest and consistent data. Also, the clients can explicitly declare that they read the data from a replica by using the follower read feature. Follower read in TiDB also can guarantee the strong consistency and freshness of read by the ReadIndex feature. When the follower replica processes a read request, it first uses ReadIndex of the Raft protocol to interact with the leader of the partition, to obtain the latest commit index of the current Raft group. After the latest commit index of the leader is applied locally to the follower, the processing of a read request starts.

In contrast, developers can request strong or eventual consistency when reading items from a table in DynamoDB.

Online upgrade

Deploying software on a single node is quite different from deploying software to multiple nodes. The deployments are not atomic in a distributed system, and, at any given time, there will be software running the old code on some nodes and new code on other parts of the fleet. The additional challenge with distributed deployments is that the new software might introduce a new type of message or change the protocol in a way that old software in the system doesn’t understand. DynamoDB handles these kinds of changes with read-write deployments. Read-write deployment is completed as a multistep process. The first step is to deploy the software to read the new message format or protocol. Once all the nodes can handle the new message, the software is updated to send new messages. New messages are enabled with software deployment as well.

In TiDB, we also deal with this situation with the version control feature. Before all components switched to the new version, the old version messages and protocol were used. After all components upgraded to the new version, the new messages and protocol used.

Push down aggregation

This is a big difference between TiDB and DynamoDB. Because DynamoDB is a nosql database, the read and write requests are very straightforward, just get/put/delete/update by key. In contrast, TiDB is a SQL database, there are many aggregations queries like: SELECT COUNT/SUM/AVG... FROM t WHERE ...

Take COUNT(*) as an example. If the compute layer TiDB pulls all data from storage nodes(TiKV) and calculates in the compute layer, there are two shortcomings, 1) This will trigger data transfer between storage nodes and compute nodes; 2) can not make full use of the advantages of distributed computing ability. To address such issues, TiDB imported the push down computing feature. The push down computing module is called coprocessor in the storage layer(TiKV). It can run count/sum/avg/sort and other aggregators on the local partial data the storage node has, and return the partial results to the compute layer TiDB. TiDB and can do the final aggregation. This can reduce the data transfer between nodes and improve computing parallel.

Multi-Tenant

DynamoDB is designed as a multi-tenant nosql database, RCU(read capacity units) and WCU(write capacity units) are introduced to describe and throttle each tenant’s read write throughput. This worked well because the API of DynamoDB is very straightforward, just get/put/update/delete by key, and the metering of throughput is very easy.

When it comes to TiDB, it is a more complicated story. You can’t know how many rows of data will be read and how many CPU/IO resources will eat before a SQL is actually finished, like “select * from t where age = ‘xxx’”. But if TiDB let all queries run first and then calculate their cost, it may make some storage nodes overwhelming when considering those push down aggregation operators. There should be a mechanism to calculate the resources eaten by those push down aggregation operators in stages as they run. We are still evolving the multi-tenant feature on TiDB cloud.

--

--

Jinpeng Zhang

Engineering Director at PingCAP, focus on building large scale distributed storage & database, and building high performance engineering team.