The Pliops Rate Limiter

Omer Kepten, Software Engineer at Pliops’ Storage Engine Group

Introduction

RocksDB has a predefined mechanism to stall and stop write operations in diverse scenarios. Though effective in certain instances, this mechanism primarily targets system bottlenecks and heuristics that are of no concern in XDPRocks. It can also be challenging to fine-tune, as most of it is hardcoded or relies on the user to provide meaningful thresholds and write rates. XDPRocks introduces a dynamic Rate Limiter that adeptly identifies the system’s saturation points and eliminates ineffective constraints that slow down write operations.

Existing RocksDB Write Stall Flows and Their Shortcomings

RocksDB has an extensive system for rate limiting its writes when shared flows (flushes, compactions) and disks (I/Os) cannot keep up. At its forefront is the Write Stalls and Stops mechanism which the Pliops Rate Limiter aspires to improve.

Write Stalls and Stops can be triggered by the following criteria:

  • Memtable Capacity: if all memtables are full and awaiting flush, the writes are stopped. Additionally, when there 4 or more memtables and only one memtable is mutable, the writes are stalled.
  • L0 file count: When the number of L0 files reaches level0_slowdown_writes_trigger, writes are stalled. When the number of L0 files reaches level0_stop_writes_trigger, writes are fully stopped.
  • Large compactions: When the estimated bytes pending for compaction reaches soft_pending_compaction_bytes, writes are stalled. When estimated bytes pending for compaction reaches hard_pending_compaction_bytes, writes are fully stopped.

Large compactions are of no problem in XDPRocks – due to KV separation, our SST files are lightweight. Therefore, stalls and stops due to pending_compaction_bytes do not apply. Since XDPRocks’ files are so small and its compactions fast, there is also no use in limiting the number of L0 files. XDPRocks hardly reaches large L0 file count, and when it does it has limited impact either capacity or performance. With that in mind, we have decided to remove all L0 file count thresholds. Our performance studies showed no degradation due to this change, and controlling this option makes sure our users do not suffer useless stalls and stops.

The memtable capacity criteria are inherently present in any data structure relying on in-memory buffers for writes. When these buffers are full, there is no fallback flow – the writes must be halted until a new memtable is available. RocksDB’s solution strives for faster, lighter flush (via multiple memtables) and allows the system to recognize when there is limited space for writes and begin stalling, while also avoiding unnecessary stalling when enough memtables are available. The stall duration is determined by a user provided initial rate (delayed_write_rate option) as well as various multiplication factors, hardcoded and chosen by the change in bytes pending compaction.

Figure 2 RocksDB Flow

The Pliops Rate Limiter Algorithm

The Pliops Rate Limiter (PRL) goal is to stall writes in a way that is in tune with the current system needs. The main understanding in PRL is that there is no need to focus on what causes the back pressure (e.g. compaction). Instead, recognize the memtables are under pressure, deduce a suitable write rate, stall the writes and avoid stops.

The key relationship in determining memtable capacity is the write bandwidth (BW) to memtables vs the flush BW. If the write BW is faster than flush BW, the memtables are sure to reach full capacity and writes will be stopped.  This sounds easy enough, if not for the volatility of flush duration (see picture below). Flushes can be greatly impacted by concurrent compactions and multiple DBs flushing to the same XDP. To handle this, PRL collects a sample vector of the last flush BW measurements. It then deduces the write rate limit by taking the minimum BW sample. By working with a recent collection of flush BWs, PRL is able to set a rate that is fitting to the current system needs and avoid unnecessarily long delays.

Figure 1 Flush duration (blue) and write stops (red) during fill of a single DB out of 16

Once a rate is set, it is employed with the same flow as the original RocksDB rate. That is, when there are multiple memtables and only one is still mutable, the Write Controller begins to stall writes according to the rate set by PRL.

For precise PRL BW measurements to be obtained, every DB has exactly one flush thread. When user specifies max_background_jobs, one thread is assigned to flushes and the rest for compactions. We recommend starting with max_background_jobs and seeing how the system performs. Another important improvement is the size and number of the memtables. PRL uses 4 memtables, and the minimal required size for a single memtable is 32MB. It is advisable not to change the write_buffer_size and max_write_buffer_number options and let XDPRocks configure them.

 

Figure 3 XDPRocks PRL flow

Results:

16 DBs, 16B keys, 2KB compressible (0.5 factor) values, dataset of 115620000 keys

Baseline configuration:

  • 2x64B memtables per DB
  • Default L0 file count thresholds
  • 16 background jobs per DB
  • PRL not active

PRL configuration:

  • 4x32B memtables per DB
  • No L0 file thresholds
  • 1 flush thread, 3 compaction threads per DB
  • PRL active

 

Configuration Workload Average Write Latency (micros) Write Latency P50 (micros) Write Latency P99.9 (micros) Write Latency P99.99 (micros) Max (micros) QPS
Baseline Fill 18.5013 1.77 4267.54 6450 265463 845659
Overwrite, 4 threads 6.6002 3.51 464.99 2726.36 68822 575220
Overwrite, 8 threads 15.1891 3.89 2613.76 4301.28 106771 513390
Overwrite, 16 threads 32.7801 5.51 4683.1 10807.16 341036 480827
PRL Fill 18.9377 1.65 4295.66 6406.9 220172 825993
Overwrite, 4 threads 5.5994 2.67 507.76 2438.01 16842 677901
Overwrite, 8 threads 11.8756 3.9 1301.34 4156.62 94270 650118
Overwrite, 16 threads 25.4907 4.76 2830.7 10738.72 26739 615850

PRL results over Baseline:

Workload Average Write Latency (micros) Write Latency P50 (micros) Write Latency P99.9 (micros) Write Latency P99.99 (micros) Max (micros) QPS
Fill 1.023588 0.932203 1.006589 0.993318 0.829389 0.976744764
Overwrite, 4 threads 0.848368 0.760684 1.09198 0.894236 0.244718 1.178507354
Overwrite, 8 threads 0.78185 1.002571 0.49788 0.966368 0.882918 1.266323847
Overwrite, 16 threads 0.777627 0.863884 0.60445 0.993667 0.078405 1.280814097

There is a clear improvement in max latency, and a substantial growth in QPS. There is also plenty of QPS jitter due to the many stalls and stops these workloads experience (see table below):

Once PRL configuration is applied, stalls and stops are removed and the QPS is stable:

 

  Baseline   PRL  
Workload Number of Stalled Writes Number of Stopped Writes Number of Stalled Writes Number of Stopped Writes
Overwrite, 4 threads 31676 45 0 0
Overwrite, 8 threads 36123 390 0 0
Overwrite, 16 threads 20936 392 0 0

 

 

Share

Category

Talk to a Product Expert!

Speak with a data expert to learn how Pliops XDP can exponentially increase your business needs.