Solid-state drives (SSDs) have become the defacto choice for modern data centers due to their low latency, high throughput, and low power consumption. They hold most of today’s hot and warm data for applications, including databases, analytics, artificial intelligence, and others for Fortune 1000, Hyperscalers, and HPC environments.

Increases in flash density to support data growth have resulted in decreases in NAND chip-level reliability. Software inefficiencies worsen the problem by increasing write, read, and space amplification, causing SSDs to underperform and wear out faster. Customers tell us SSD faults in the servers hosting data-hungry applications are the single largest cause leading to significant downtime. As a result, they find maintaining high performance, high-reliability SSD-based storage system challenging.

Even with RAID and replication schemes, SSD failures cause recovery and repair overhead, affecting the cost and performance of server and storage systems. Site reliability engineers (SRE) need to take the failing host out of its cluster, causing it to rebalance, increasing application latency. If an SSD needs to be replaced, data center technicians will physically swap the drive. The bottom line is that this has a cost in terms of time, monetary impact on the business, and customer experience.

Does this sound like a familiar scenario? What options are there to protect the system from SSD related downtime without major tradeoffs?

Let’s compare the most common RAID configurations:

  • RAID 0 – This configuration offers maximum performance, but no data protection.  A single drive failure results in server downtime and total data loss. May implement data protection at the cluster level but leads to longer re-balancing times.
  • RAID 10 (1+0) – Multiple sets of mirrors (RAID 1) striped together (RAID 0). This configuration offers good performance with good data protection, but at high cost.
  • RAID 5 – RAID 5 protects from a single drive failure by striping data across all drives and distributes parity data across all drives. RAID 5 has a huge penalty for write performance and amplification which accelerates SSD wear out. Rebuild times are painfully slow and CPU overhead is significant for software RAID 5 configurations. A spare drive is also required where a failed drive’s allocated capacity is rebuilt.

The reality is all traditional RAID options come with big tradeoffs in terms of protection, performance, or cost. What if you can have your cake and eat it too? Pliops takes a new approach that eliminates these tradeoffs. The Pliops Extreme Data Processor (XDP) delivers full NVMe SSD performance while protecting from multiple drive failures. We call this Pliops Drive Failure Protection (DFP).

Traditional RAID vs. Pliops XDP

In addition, XDP reduces write amplification by up to 90% making it possible to use the lowest cost, highest capacity TLC and QLC SSDs in the data center. XDP can be a game changer because it delivers full NVMe performance and eliminates SSD related server downtime.

Drive Failure Protection Highlights include:

  • Flash Optimized Architecture: Breakthrough data structures and algorithms ensures optimal protection without slowing performance to meet demanding service level agreements (SLAs)
  • Virtual Hot Capacity (VHC): Unique dynamic capacity allocation eliminates the need to allocate any drives as spares
  • Drive Failure Protection: Multiple drive failure protection to prevent data loss provides increased storage resiliency
  • Power Failure Protection: Non-volatile memory (NVM) preserves meta and user data against loss
  • Automatic Rebuild: Recovery immediately begins using available VHC capacity without reducing usable capacity

Learn more about Pliops Drive Failure Protection, and how it can help increase your server reliability.

Drive Failure Protection Solution Brief
Cloud Service Provider Case Study