Flash storage is fast and expensive, an issue for system designers: load individual servers with SSDs whose performance far exceeds server needs, or put the flash in costly shared arrays? What's needed is a performant way to share individual SSDs across servers - and here's a solution.
At the this month'sNon-Volatile Memory Workshop, Stanford researcher Ana Klimovic, presented the results of her work with Heiner Litz, in an extended abstract titled ReFlex: Remote Flash ≈ Local Flash and slide deck which offers a novel solution to the SSD conundrum.
As early as 2010, it was obvious that Solid State Drives (SSD), using flash memory, would easily equal the IOPS performance of $100,000+ storage arrays at a fraction of the price. Furthermore, the SSDs would not require a storage area network (SAN), as each server could have its own internal PCIe SSD with latency that no SAN array could match.
In those early days, enterprises were thrilled to get SSD performance, even though a 400GB SSD cost several thousand dollars. But as enterprises and cloud vendors adopted low-cost, shared nothing, scale-out infrastructures - typified by the Google File System and Hadoop - the unusable stranded performance and cost of server SSDs has become a major issue.
WHAT IS REFLEX?
ReFlex is a software-only system
for remote Flash access that provides the nearly identical performance to accessing local Flash. ReFlex uses a data plan kernel to closely integrate networking and storage processing to achieve low latency and high throughput at low resource requirements. Specifically, ReFlex can serve up to 850K IOPS per core over TCP/IP networking, while adding 21μs over direct access to local Flash.
The performance of ReFlex is due to several key factors:
Hardware virtualization capabilities of NICs and SSDs to operate directly on hardware I/O queues without copying.
The dataplane kernel dramatically reduces I/O overhead compared to library-based I/O calls.
A novel quality-of-service (QoS) scheduler enforces equitable sharing of remote devices by multiple tenants while minimizing long-tail latency.
I/Os are batched when possible.
Polling-based execution removes the uncertainty and overhead of interrupt-based I/O.
PUDDING PROOF
As implied above, a two-core server using ReFlex can fully saturate a 1 million IOP SSD. That compares favorably to the current Linux architecture that
uses libaio and libevent, [and] achieves only 75K IOPS/core and at higher latency due to higher compute intensity, requiring over 10× more CPU cores to achieve the throughput of ReFlex.
In addition, testing found that ReFlex can support thousands of remote tenants, a vital consideration where a single cloud data center may have a hundred thousand or more servers.
THE STORAGE BITS TAKE
Fast flash has been with us for over a decade, yet system architects are still figuring out how to optimize its use in the real world. Of course, a decade ago, the massive scale-out architectures that ReFlex is designed for were the province of only a few internet-scale service providers.
However, once ReFlex - or something like it - is built into system kernels, many of us, even with much more modest hardware footprints, will be able to take advantage of shared SSDs. Imagine the performance gains, for example, of an eight-node video render farm equipped with two high-performance PCIe/NVMe SSDs and a 10Gb Ethernet fabric.
ReFlex-type capabilities become even more important as newer, higher performance - and costlier - non-volatile memory technologies, such as Intel's 3D XPoint, come into wider use. The economic benefits of shared remote SSDs will be even greater than they are today.
Courteous comments welcome, of course. ReFlex won the NVMW'18 Memorable Paper Award.
No comments:
Post a Comment