Having Fun at Queue Depth = 1: What Next Generation Non Volatile Memory (NG-NVM) means for PCIe SSDs and SSD Drivers

November 12, 2015

Author: Stephen Bates (@stepbates)

Update – 26 Nov 2015 – Well things can move very fast in the Linux world when they want to! Since I wrote this article an improved, but still pre-production, version of the polling code for the block layer and NVMe driver have made it into the Linux kernel and will go mainline in 4.4. There is a really nice overview of how it works here and Jens’ patch-set comments and some of his testing results can be found here. It is worth stressing that the results we present should only improve as the polling mode evolves. Stay tuned for updated performance results in due course!

Introduction

I love SSDs! They have transformed the data center by providing high-performance, low latency access to storage. Low latency is transforming the data center stack. I will be digging into latency in my next few blog posts, starting with Driver latency here.

Characterizing the latency of QD=1 random NVM Express reads is harder than one might expect. However you can break this down:

  1. Media latency
  2. Controller latency
  3. Fabric latency
  4. Driver/OS latency/CPU latency

In the next few blog posts I will expand on these four. In this post I’m going to start with point #4: Driver/OS latency/CPU latency. This gets really fun because there are so many variables. What OS? Assuming Linux what version of the kernel? What type of CPU (x86, ARM, PowerPC, Sparc, etc.)? How many CPUs, how many HW threads, memory subsystem, how is interrupt handling done, are you polling or interrupt driven, etc.?

The NVM Express Driver – It’s fast, but is it fast enough?

A bunch of companies, including PMC, have worked hard to design NVM Express to be fast. It is much faster than legacy storage protocols (have a look at how one PCIe SSD outperforms configurations of eight and four SATA drives in an OTLP database application) but is it fast enough for Next-Generation NVM (NG-NVM)? To test that, we stuck a PCIe Logic Analyzer between an Intel x86 CPU and a PMC Flashtec NVRAM card and did some measuring. We obtained some interesting results.

The latency of a random read is composed of multiple parts. Those fall into two bins: the bin the SSD controls and the bin the host controls.

  1. For the SSD the latency is very well controlled. You can see in the plot below how we control that latency to be both quick (under 9us on average) and tightly bounded (always better than 11us).
  2. For the Host the latency is not very well controlled. You can see the average for non-SSD latency is only 5us, but the maximum in the measured period was more than 30us.

Sources of Latency

 

Where does this non-SSD latency variability come from? After some digging, we discovered it comes from the handling of the MSI-X interrupt and the passing of that interrupt back to the OS. There are many things that can impact how long this takes and have implications for latency and QoS.

Fixing the Driver!

So how do we fix the interrupt issue? Interestingly there are a couple of separate activities going on right now to address this:

Intel just launched a Storage Performance Developer Kit (SPDK) that attempts to improve the performance of NVMe SSDs. You can learn more about SDPK here. SPDK tries to address the issue I raised above in two ways:

  1. It polls the completion queues rather than using MSI-X interrupts.
  2. It runs in user-space which avoids the context switching associated with jumping from kernel-space to user-space.

Separately there is work going on within the Linux kernel block layer to add polling to block devices (including NVMe devices). You can view the current codebase for this here.

We compared these 2 methods (SPDK and polling driver) with the traditional driver.  The results for latency, CPU load and throughput (at QD=1) are summarized below.

As you can see SPDK and the polling driver achieve better IOPS and latency QoS at the cost of increased CPU load. The polling driver performs slightly worse than SPDK but does have the advantage that it is tied into the block layer of the Linux kernel and can therefore provide services SDPK cannot. In addition, the kernel community is working to improve the polling driver prior to its integration into the upstream kernel.

Conclusions

As SSDs get faster and faster and start to take advantage of NG-NVM (e.g. PMC Flashtec NVRAM or Intel’s Optane SSDs) the overhead of servicing the I/O in a driver and OS is becoming more marked. The things we used to do because storage was slow (interrupts) make less sense now that storage can be very fast. This has implications all across the compute stack, from caching to tiering to fast primary striate.  This is a fundamental shift that offers unprecedented opportunity. This work is feeding its way into operating systems, applications and even computer hardware. I will touch on this more in the next blog post.

facebooklinkedinmail





6 thoughts on “Having Fun at Queue Depth = 1: What Next Generation Non Volatile Memory (NG-NVM) means for PCIe SSDs and SSD Drivers




  1. Hi Steve

    Thanks for the article. Is the last column: CPU Load, in percent or a fraction of one? Is load 98% for polling or 0.98%?

    Also would interrupt coalescing play a role in creating more efficiency for higher loads?

    Thanks again.

    • Hi Siamack

      CPU load is as per given by the Linux tool ps which means it is measured where 100% = one hardware thread. On x86_64 there are two HW threads per core.

      Since we were operating at QD=1 there was no need to look at interrupt coalescing, However yes it could definitely help at higher QD and with multiple threads.

      Thanks for asking, good questions, and thanks for taking an interest in the blog!

      Cheers

      Stephen

  2. Great article Stephen! When are parts 1, 2, and 3 getting published?

    Is this like Star Wars where you’re starting at the end of the saga and working backward?

    • Hi Kurt

      I hope all is well in your neck of the woods. Thanks for letting me know you enjoyed the post.

      I, like NVMe, am allowed to do “out of order” completions ;-)! I do like the idea of a Star Wars theme though. Maybe “The Return of the IO” or “The Die Contention Issue Strikes Back”.

      Anyway stay tuned for more soon.

      Cheers

      Stephen

  3. Hi Amnon

    Thanks for the question and good to hear from you! The polling results give better latency because the CPU is constantly reading the tail of the completion queue looking for the IO to complete. This is different to the default driver that uses a MSI-X interrupt to wake the OS thread up when completion occurs. As noted in the article you trade latency for CPU load when you switch between polling and interrupt. Results will vary from one OS to another and even from one kernel version to another.

    I agree that there are other workloads that are very interesting to look at. For this article we chose the QD=1 random read workload for four reasons:

    1. This is the lowest latency workload for random reads.
    2. The single threaded workload used made analyzing the results easier.
    3. We actually wanted to avoid write related issues like WA and GC in this analysis.
    4. We could not be bothered to pre-condition the drives ;-).

    That said, doing a similar analysis for a 70/30 read/write mix on a pre-conditioned drive would be interesting and hopefully we will do that soon!

    Cheers

    Stephen

  4. Interesting to see that the polling mechanism gives better latency results. Is that a function of the OS being used (and what OS was used in this experiment)?

    The 100% read and QD=1 is an interesting benchmark, but I usually get more questions around mixed workloads (at varying percentage) and the effects of reads stalled behind writes/erases

    Amnon

Leave a Reply to Stephen Bates Cancel reply

Your email address will not be published. Required fields are marked *


× 8 = sixteen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>