How We Scaled a Distributed SQL Database to PB Level

16 min readDec 2, 2024

Note: This is a text version of my topic on KubeCon 2024 NA.

The Challenge of Scaling Databases for Business Growth

Business growth is an exciting milestone, bringing more users, transactions, and opportunities to innovate. However, with growth comes significant technical challenges:

Data Volume Explosion: Rapid user activity generates terabytes to petabytes of data that must be stored and managed efficiently.
Increased Complexity: Applications need to scale to support more users, while new features are rolled out to improve experiences and meet evolving demands.
Operational Burdens: Traditional database solutions struggle to handle massive data growth, leading to downtime, delays, and costly operational overhead.

Bolt: a fast growing mobility company

For example, Bolt is a fast growing mobility company in Europe. Bolt has more than 100 million users, and provides service in more than 500 cities. Bolt has gained 400% growth since the pandemic. Bolt has used MySQL as the backend database for many years. These are technical challenges during their growth (get more details from Bolt’s sharing in the HTAP Summit 2023 conference):

Hundreds of TB to 1PB data volume
More than 100 service deployments every day
More than 100 database changes every week
5~10TB of new data generated every month
Hit single MySQL instance size limitation. The largest MySQL instance has 8TB data. Backup and restore for such a large MySQL instance took many hours, which is not acceptable for applications.
Schema change took up to 2 weeks with gh-ost/pt-online-schema-change for MySQL. It will freeze db for seconds when doing drop table.
High availability issues. MySQL master switch causes downtime, upgrading with downtime, bad connection handling mechanism of MySQL for spike QPS, which affects availability of applications.

*MySQL is not good when handling massive new connections*

Traditional solution for data growth: sharding

Facing the data growth challenge, the most traditional solution is to shard the backend database. In order to handle more connected users, there would be more application servers deployed. And for the backend database, there will be multiple database instances. A large table sharded into multiple sub-tables and distributed across these multiple database instances.

But there a lot of drawbacks with database sharding solution:

Need to re-write the application’s code. Previously, with a monolithic backend database, applications can query data by primary key (for example by user id) or secondary indices (for example query specified type of transactions, or time range index query). After sharding the backend database, in order to route queries to the correct database shard, the application probably needs embedding database routing code. In order to query data by secondary index like time range, query should route to all backend database instances and then aggregate results in the application layer. There are some database proxies like Vitess try to hide the sharding complexity from application layer. But these proxies are still facing operational complexities like resharding, consistency between sub-tables when doing schema change, lack of cross shards JOIN support, the consistency of cross shards queries, etc.
Compromise some SQL functionalities like JOIN. Since data is already partitioned into multiple shards across multiple database instances, doing queries like JOIN is not an easy job.
As the data volume grow, shard split and resharding data is a headache.
Keeping schema consistency between different shards is difficult.
Disaster recovery is a complex job. Backup a consistency database snapshot needs a lot of extra effort.

What we need is a Real Scalable Distributed SQL Database

Facing this business growth and data grow challenge, what we really need is a real scalable distributed SQL database that has following strengths:

Horizontal scalability: we just need to add more nodes to the database, to satisfy data growth and QPS growth
Low operational complexity: 1) no need to change the application layer’s code; 2) no compromise on some SQLs like JOIN; 3) out of box schema change, backup, CDC, data loading, etc.
Low read/write tail latency with large amount of data volumne: has p99 single-digit millisecond latency, even with hundreds of TB data.
High availability: the database has >99.99 availability, and can tolerate AZ level failure

For the past decade, we have been focusing on building such a distributed SQL Database TiDB.

TiDB: an open-source scalable distributed SQL database

TiDB is an open-source, cloud-native, distributed MySQL-compatible database for elastic scale. TiDB’s github repository is https://github.com/pingcap/tidb. You can think of TiDB as a size unlimited MySQL but with other strengths like distributed data aggregation and data analytics capability. Applications connect to TiDB’s compute layer using the MySQL protocol, workload and data is automatically balanced across different nodes.

Not like some other databases that start from MySQL/PostgreSQL code base, we built TiDB from scratch, designed TiDB as a cloud-native and distributed system from the day 1. This choice makes TiDB’s up boundary is higher in terms of scalability. As showed in the follwoing diagram, TiDB has a disaggregated compute and storage archtecture that allow the compute layer and storage layer scale respectively, this results in a better efficicency.

*Disaggregated Compute and Storage Archtecture of TiDB*

TiDB with 1 million QPS

Flipkart is the largest E-Commerce company in India. With over 400 applications, thousands of microservices, and countless varieties of end-to-end e-commerce operations running across over 700 MySQL clusters, Flipkart runs one of India’s largest MySQL fleets.

In order to satisfy the large request throughput from annual big billion day activity (like Black Friday in the US), meanwhile maintain hundreds of MySQL instances has a lot of operational complexities. Flipkart needs a scalable SQL database that can handle such high throughput with low latency and can consolidate multiple MySQL instances into one such SQL database to reduce the operational cost. At around 2020, Flipkart came to TiDB, and the first production hot store TiDB was deployed in early 2021. Now TiDB with Flipkart has experienced several years’ big billion day activivies.

Flipkart did a 1 million QPS test over TiDB to verify the scalability of TiDB. They achieved 1 million QPS with single-digit millisecond tail latency with around 32 TiDB nodes. Meanwhile, because of TiKV’s high data compression, the data footprint is reduced by 82% compared to MySQL. See more details in Flipkart’s blog Scaling-TiDB-to-1-Million-QPS.

TiDB with 1.8 trillion records

The following diagram shows the data volumne of a production TiDB cluster in Zhihu — one of the largest online social network platform. There are around 1.8 trillion records in this single TiDB cluster. See more details in Trillion Rows of Data, Milliseconds of Response Time. BTW, this blog is based on a relevant old version of TiDB(back to 2019), TiDB has made a lot of improvements since then in terms of scalability, stability, performance and resilience, etc.

On our official website find more TiDB’s adoptions like Pinterest, Databricks, Plaid, LinkedIn, etc.

Challenges of scaling a distributed SQL database to PB-level

Let me reveal some technical challenges we have faced when we building our scalable SQL database in the past decade, and also explain how we overcame these technical challenges.

Metadata Management Challenges

In a large scale distributed data or storage system, the metadata center might become the bottleneck of the whloe system when scale to hundreds of to thousands of machines. In TiDB, the PD(placement Driver) is the metadata center. PD responsible for multiple tasks:

Monitor the health of hundreds of storage nodes. All storage nodes will periodically send heartbeat to PD in order to indicate the health of storage nodes. If one storage node fails, PD will recognize this failure after a while and trigger adding data replicas to other nodes. By default, if a storage node fails last for 30 minutes, it will be recognized as a failed store, see pd parameter max-store-down-time.
Cache the location and leadership for millions of data ranges. Data has been partitioned into millions of data ranges and replicated across multiple storage nodes. In order to route read/write requests for a specific data range to the right storage nodes, PD needs to cache the location and leadership of these data ranges.
Besides data auto-balancing, TiDB also supports workload auto-balancing. TiKV collects read/write QPS and bytes flow in real-time for data ranges, and then reports these statistics to PD. PD then trigger workload re-balancing according to these statistics.
As a SQL database that supports ACID transactions, PD also take the responsibility to allocate monotonic timestamps for millions of ongoing transactions.

From above listed responsibilities of PD, we can see that PD needs to allocate millions of timestamps, needs to handle millions of data ranges’ location and leadership, needs to handle millions of data ranges’ real-time workload statistics and schedule data ranges accordingly, etc. PD takes a lot of responsibilities which make it potentially the bottleneck when scale to hundreds of TBs and several PBs. We did some engineering optimizations to make PD handle higher throughput:

Support batch TSO API to increase monotonic timestamps allocation throughput.
Parallel heartbeat handling to accelerate data range heartbeats handling speed.

Besides these engineering optimizations, in the architecture perspective, we introduced PD microservice from version 8.0 to scale the metadata management capability further. In PD microservice architecture, the TSO and scheduling functionalities have disaggregated from PD as independent services which can be deployed on dedicated machines. In this way, different services are physically isolated, there is no interference between each other. For example the TSO service can response timestamps requests in stale low latency.

Hotspot workload challenges

The hotspot workload issue is one of the biggest headache issues for a large scale distributed system. If the majority requests are routed to minority nodes, this will result in the “workload hotspot issue”. In that case, these minority nodes that are easily run into overwhelmed status, and will be the bottleneck of scaling the system to a large volume.

There are multiple reasons that may cause the workload hotspot issue:

Multiple hot tables’ data ranges are located in the same node by coincidence. For example, there are 100 tables and 5 of them have high read throughput. If these 5 hot tables are located in one node, this node might be overwhelmed. In our system, TiKV will collect different data ranges’ read and write flow, and then report it to the PD scheduling model. The scheduling model will recognize these potential hot data ranges, and then rebanlance them across multiple storage nodes. See PD hotspot scheduling for more details.
There is an extremely high read QPS on a small range of data. For example, there is a small table with 100 rows but has 100K read queries per second. Because this table is very small and it likely just has one data partition, 3 replicas of this data partition are located in 3 different TiKV nodes. All read requests by default will be routed to one TiKV which own the leadership of this data partition. This will cause this TiKV node become a hotspot. In this case, we introduced the load-based split and scheduling strategy to handle this type of hotspot issue. The basic idea of load-based split is that TiKV will collect the read QPS distribution inside this hot small data range, and then try to split this small hot data range into multiple smaller data ranges according to the read QPS distribution, and then balance these small data ranges to other nodes to make the workload evenly distributed across multiple nodes.
Append writing to a data range tail. This type of hotspot issue is usually caused by bad table schema design. For example, the following statement creates a table with a PRIMARY KEY (last_access, user_id). We can imagine that, for the primary key, all write(insert/update) will happen in the tail of primary key data range, which causes a write hotspot issue. In a distributed system, we highly suggest avoiding such schema design.

CREATE TABLE user_access_logs(
  last_access TIMESTAMP,
  user_id int(20),
  …
  PRIMARY KEY (last_access, user_id)
}

Observability challenges

As a large distributed system with hundreds of nodes, how to monitor the health and resource usage of each component, how to identify potential hotspot issues, how to locate the root cause of some slow queries. All these questions can be answered by the system’s good observability. As a distributed SQL database, TiDB provides observability from different layers.

Cluster Level Metrics

From the cluster perspective, TiDB provides information like cluster topology, data volume distribution, CPU usage, memory usage and IO usage of different nodes, aggregated QPS of the whole cluster, P999 P99 P95 P80 query latency distribution, proportions of different statements, etc.

With these cluster level metrics, users have a whole picture of how this cluster is going. One or few nodes’ CPU would be higher than other nodes if there is a hotspot issue.

Key Visualizer

If we observe a hotspot issue from above cluster level metrics, how to quickly locate which table or index has a higher read/write throughput and then users can resolve such issue? TiDB Key Visualizer answers this question.

As I have described in the hotspot workload section, TiKV will collect the read/write qps and bytes flow for each data range, and then report the statistics to metadata center PD. Key Visualizer is a tool that can make these statistics visible for users. With the Key Visualizer, users can easily locate which table, which index has a high throughput. The higher QPS a data range has, the brighter it will be shown in the Key Visualizer. The bright yellow line in the following diagram shows there is a hotspot read in the table of bmsql_order_time’s primary key.

Top SQL and SQL Statements

If there are some slow queries that eat too many resources, we need to identify them and optimize them to avoid them slowdown other queries. The TiDB Top SQL provides a way to quickly locate these slow queries. As shown in the following chart, these slow queries are ranked by resource usage. From the following Top SQL page, we can see that resource usage has increased suddenly since 5 minutes ago. There must be some new queries that eat a lot of resources, or there might be some existing queries’ execution plan changed into a suboptimal way.

Top SQL is used to locate these slow queries which eat a lot of resources, SQL Statements is a place that you can find out which query pattern dominates your workload, is it expected or not. Query patterns are ranked by frequency in the SQL Statements page.

SQL Visual Plan

On the Top SQL page and the SQL Statements page, you can click these statements and see the details of a specified statement. You can check the execution plan of this statement in a visual way, the SQL Visual Plan shows how many steps the execution plan has, how many rows each step has scanned, how much time each step takes, etc. It is helpful for users to view and optimize their SQL execution plan.

Operational Challenges

As a large scale system, there are multiple operational challenges for TiDB. If users have a large amount of data in object storage like S3, how fast it is to import the large amount of data into TiDB and provide service. If there is a large table with 100TB data, how long it will take to add index for this large table to satisfy applications’ changing requirements. Although data in TiDB has multiple replicas and distributed across multiple TiKV nodes, data backup still is a strong requirement for most applications in case of disaster recovery. How long will it take to back up a large cluster? All these will affect the agility of applications’ innovation as well as the operational experience and cost of TiDB. I will show how TiDB handles these challenges in the next 3 sections.

Operational Challenges: import large amounts of data

Importing a large amount of external data into TiDB, this usually happens when users initializing the data for a new TiDB cluster. For example, migrate existing HBase to TiDB like Pinterest has done. TiDB supports IMPORT INTO statement to achieve such approach. It supports Parquet, SQL, CSV 3 file formats.

Underlying, this statement will be executed on the Distributed eXecution Framework which can use multiple TiDB nodes’ resources to compute, in this case is transforming these files into the storage node TiKV’s SST data format, and then ingest into the storage layer.

The following table shows it took 31h 30m to import 100.36TB files from S3 into a TiDB cluster with 12 TiDB nodes. The `txn_history` table has 55 columns and 5 indices. If we add more TiDB nodes and TiKV nodes in this case, the total time can be reduced linearly.

Operational Challenges: add index for large tables

As shown in above Bolt case, changing table schemas to fit the evolution of applications is very common. Bolt needs to make more than 100 schema changes each week. Adding index is one of the most frequently used schema change. As a large scale SQL database, TiDB supports online schema change from day 1, which means schema change will not block foreground requests. Meanwhile, adding index in TiDB is very fast.

The following diagram shows how TiDB gain fast adding index by leveraging the power of the Distributed eXecution Framework. Each TiDB node is responsible for a part of this job, scans corresponding data range, generates SST files and ingest into TiKV layer. Meanwhile, the new write(insert/update/delete) of this table will be tracked and recorded.

We did multiple adding index tests for a 40TB table. The following table shows when we ran the add index in single node mode, it took around 14h to finish the whole job. When we ran it in distributed mode, it took around 1h to finish the job which is linearly scalable, because there are 14 TiDB nodes in this cluster. When we ran a single statement to add 4 indices for this 40TB table, it took 1h 27m not 4h because these 4 indices reused the same scan result during the procedure.

Operational Challenges: backup & restore a large database

Although, by default TiDB will replcate the data across multiple nodes to achieve high data durability and high availability, as the data backend of applications, backup is a must have requirement to make sure the data can be recovered after a severe disaster.

As a large scale system, how to make the backup is scalable is another technical challenge for TiDB. TiDB supports 2 kinds of backup, snapshot full backup (TiDB Snapshot Backup and Restore) and incremental change log backup (TiDB PiTR). The full snapshot backup dumps a consistency snapshot of the whole cluster, a specified database or table, and then upload to remote storage like S3. Each storage node (TiKV) upload their data in the form of SST files parallellly.

We did two sets of tests for 2 different TiDB clusters with 100TB and 309TB data respectively. The first 100TB backup took 45 minutes, and the 309TB backup took 1h 4m. Even the 2nd cluster has 3 times of data size compared to the first cluster, the totle backup time just slightly longer than the first cluster. This is because how much data each storage node has is the key factor that impact the total backup time. For the 2nd cluster, each storage node has 3.1TB data, which is slightly larger than the first cluster.

Multi-tenancy: Noisy neighbor challenge

In some of our customers, they need to manage hundreds of MySQL instances because each application preferred to use their own database intance for isolation purpose. Managing such a large number of database instances is painful for customers. Imaging how to do daily backup, how to do high availability, how to do cost optimization for these database instances, etc. After migrating to TiDB, users tend to merge multiple MySQL instances into one TiDB cluster to reduce the operational cost. These database instances might be used by different applications, this causes a new challenge: how to isolate the impact of a noisy neighbor appilcation. This noisy neighbor application may have bad queries that may impact other application’s response time.

TiDB introduced the Resource Control feature to resolve the noisy neighbor challenge. By defining resource groups and limiting the max resource usage of different groups for different applications, these latency sensitive applications can gain stable low latency.

Besides the scenario of consolidating multiple applications into one TiDB, this feature also benefits SaaS applications. SaaS applications typically have multiple tenants, allocate dedicated database instances for each tenant is not feasible in terms of its high resource cost and high operational cost. This Resource Control feature perfectly fits SaaS scenarios.

We simulated 2 workloads in the same TiDB, one is latency sensitive order application, and the workload is a resource comsuption report workload. The left part of the following 2 charts shows the order application is heavily impacted by the report workload without resource control feature. The right part of the following 2 charts shows the order application gaining stable QPS and stable low tail latency when enabled the resource control feature.

*Gain Stable Latency by Resource Control*

Other Channges

Besides above challenges, there are other challenges when we were building the scalable SQL database TiDB, like how to guarantee the freshness of statistics for a large table, how to integrate the data changes of TiDB to data lake platform or data warehouse platform in real-time, how to migrate data from other databases to TiDB, etc. I will reveal more technical details of these challenges in the future.