Data Protection: Erasure Coding
There’s a new old buzzword surfacing in the storage space: “Erasure Coding”. Sounds pretty fancy! Believe it or not, most of you are using some form of erasure coding (EC) in your data centers today. The concept generally describes the math behind data availability and redundancy (think: RAID). What we are seeing now, is traditional RAID exposing significant limitations with both large, slow drives and super-fast SSDs. RAID also doesn’t generally describe data locality for availability purposes. Before we get into advanced erasure coding, let’s talk about what RAID-5 and RAID-6 really are.
So RAID-5 – simple right? We need one extra drive for “parity”. We commonly understand parity as “extra bits” to reconstruct a single failed drive (common RAID-5). The math is certainly worthy of a “Common Core” exercise, but simply we can use simple XOR functions to reconstruct missing information, based on running an initial XOR on the data (don’t worry about what XOR is – that’s another article!). With some binary math, we can tolerate the failure of a single drive. This method is actually a form of erasure coding. We’ve used XOR to encode redundant bits across different hard drives.
OK got it. How about RAID-6? Not so simple. RAID-6 uses formulas more complicated than I can go into in this article. You can look up the math behind RAID-6, and see how it works. The point is that RAID-6 builds on the simple XOR, for RAID-5, but uses another function (or functions) to double and distribute the parity (allowing for 2 drive or component failures). Many RAID-6 implementations use erasure coding (EC) based on Reed-Solomon. What’s really interesting about Reed-Solomon, is that it’s used in everything from CD’s to DVD’s, to error correction on your home cable modem, and RAID-6. Even more interesting – it was developed in 1960! Problems can arise with performance. You really need specialized hardware to do these calculations, or use a modified implementation (i.e. specialized for RAID-6) to have acceptable performance.
With large, slow drives, ever increasing in capacity, the risk of RAID-5 is unacceptable. Large, slow drives have 3 problems: they are slow (rebuild time), they are large (lots of data at risk), and they have a higher failure rate (more likely to fail). They are cheaper though! To protect against failure, we use RAID-6 (usually). Great solution, but as the drive sizes increase the computational effort and rebuild time/risk will eventually bring us to the same place RAID-5 brought us. In a different light – with SSDs, we could be over working the drives and prematurely wearing out the media with these calculations.
What’s next? RAID-7? Nope. RAID-7 is proprietary and not viable. Instead you’ll start hearing about erasure coding, and what the brightest storage and operating system companies are doing about it. It’s of particular concern when we elevate this to cloud-level and want to protect vast amounts of data, on disparate systems, ensuring recovery is fast and complete, utilizing as many “bees in the hive” as possible to encode and decode data when needed. As it stands, many implementations are proprietary now. Microsoft Azure, for example, uses a form of erasure coding to protect data in Azure Storage. They have a paper that describes the mechanism.
Some storage arrays already do this, but don’t call it erasure coding. One that comes to mind is the HP 3PAR. 3PAR’s proprietary implementation of Fast RAID uses a custom ASIC, then dividing and conquering at the drive level, to let all drives participate in encoding and decoding (rebuilding). 3PAR also has incorporated different availability concepts with data locality, to protect against shelf failures and still have a configurable amount of redundant data throughout the array (i.e. N+X copies). We’re also seeing web-scale take it on too – Nutanix, for example, has a really interesting EC solution (EC-X, patent pending) which leverages both their data-deduplication and compression engines.
Certainly, RAID-5 and RAID-6 will be here for a long time, but advances in technology dictates advances in the way we protect data. The benefits are both increased resiliency and better storage efficiency. As the amount of information grows the amount you might “waste” on RAID-5 or RAID-6 becomes really important. Consider a simple 120TB at RAID-6. Let’s be optimistic and pool 20 x 6TB drives. You’ve lost 12TB of storage just to parity. Realistically that 120TB is probably broken into a few different arrays – that 12TB could become 24TB wasted just by halving it. This is where the advanced erasure coding implementations come in, so you can “use” more of your storage with better data redundancy and faster rebuild times.
Bottom line - when you consider your next storage (or hyper-converged) platform, instead of simply asking “does it support RAID-6”, just have a good conversation about data efficiency, performance, and resiliency as it pertains to your business requirements. Need someone to have that conversation with? We'd be happy to lend an ear about your initiative.