One of the questions that we have asked ourselves when we started developing XtremIO X2 was “How do you keep improving the performance if the storage controller’s CPUs can’t cope with the load?”
For those who aren’t sure what I’m talking about, let’s recap the recent history of Intel CPU development, I suggest you have a quick read here,
Now this comment Is of course a little bit extreme but like many of the examples I’m using, it brings the message across, now, how is this related to storage? SSD drives are getting faster and faster every year, they are not limited by the HDD’s 15K RPM speed anymore and just like any other performance load balancing, once you resolve a performance in one aspect of your environment, it will move to the next component which is in the case of a AFA array, the CPUs inside the storage controllers, nodes, engines or whatever you want to call them.
XtremIO is already using a scale out architecture which means ALL the available storage controllers in the cluster (up to 8-Xbricks, 16 Active/Active controllers) but we wanted to take it to the next level..why? because you see so many vendors out there starting to preach the NVMe media but what they Don’t tell you is that they don’t really take advantage of the performance that media can offer because they are bounded by their dual controller Active/Passive architecture. When you release a new storage product, praising that you are now so fast but then you write comments like “Up to 50% lower latency than Xyz for similar workloads in “real-world scenarios” but you don’t really break these down to block sizes, you know there is an issue here and the marketing team is trying really hard to cover it.
We took an alternative route, our performance design criteria had to solve these goals
Using SSDs to provide:
- Faster than traditional NVMe architectures
More cost effective than NVMe (the ability to accommodate both tier0 and tier1 workloads)
Ok, so where DO you start? You start with analyzing your customer’s workloads, below you can see a slide we share for the first time which shows that the majority of our customers install base are running block sizes that are smaller than 16kb, this is an histogram chart and is really the only way to break down a performance workload (or workloads in our case)
These numbers weren’t really a surprised to me, I began my career as a VMware “guy” and I know that many generic workloads out there were running small block sizes but it was encouraging to see this being backed up with a real life data, add this to the fact that XtremIO is heavily used for
- DB’s (OLTP uses 8k, DWH uses 64kb)
- VDI (4kb, 1kb, 512byte and 1MB)
- Generic VMs (4kb, 512bytes etc, similar to VDI if you will)
Ok, so we know we wanted to improve our already great performance and make It even better as we scale to the next years where performance is something you simply want “more” of. We also know which block sizes we wanted to put emphasize on.
Ok, so we know what blocks sizes are dominating, what next?
Implementing a Flux Capacitor or as we call it, “Write Boost” , for those who haven’t see “Back To The Future” (really??), the Flux Capacitor is the magic ingredient making the commodity Delorean go back and forth in the time
To better understand the genius behind the application performance boost lest first look at our current X1 generation. Note that the full architecture is much more complex but I did wanted to provide a very high level because saying we added “magic” doesn’t really cut it
The IO arrives to Routing layer, where after the SCSI stack the HASH is being calculated.
In the next step, the Control Layer handles the table, identifies where (on what logical address) the content (HASH) shall be located.
And finally, the Data Layer, handles the translation to Physical layer. Before writing to drives, the data is landing in the UDC, from which we return the Commit to the Host.
All the operation starting from SCSI and till it lands in UDC is synchronous, while the de-stage from UDC to drive (Physical Layer) is asynchronous operation.
Now to the changes in the new – X2 architecture.
The process looks the same, with one major change in the Control Layer – we added the Write Boost. The commit to the Host will be returned, right after the data lands in the Boost and it eliminates several hopes from the Synchronous operation. The result is amazing, the improvement in latency is several cases is around 400% and allows XtremIO to address application requirements with 0.5mSec requirements!
The later step, from Boost to the Data Layer now performed in asynchronous mode. On this stage, we also have another new optimization, Bandwidth oriented – as of now we can aggregate all small writes coming to the same 16K page.
It differentiates from any industry cache implementation and ensures that we will never run out of Boost space!
Below you can see a VDI workload of 3500 VDI VMs running on a single X2 array, 0.2 MS latency !!
On X1, the same load was 1.5 ms and even above in some cases
But are VDI resemble generic VSI workloads? Absolutely!
Generic VMs tends to use small block sizes as can see above,
The same rule apply for OLTP workloads, see the difference between X1 to X2 for the SAME OLTP workloads, even though OLTP are 8kb based, the small blocks that were used in a very SMALL proportion, consumed a very LARGE chunk of the CPU (“little’s law”)
Note that in the example below I really took X1 outside of it’s comfort zone, it more than capable of providing a sub ms latency but I wanted to compare a very intense workload..
Little’s Law ??
Yea, little’s Law is a good way to calculate averages of a wait time(s) for items in queued system, everything is working as expected when there is no queuing involved but there is ALWAS queuing involved, here’s a real life example using the ultimate source of truth, Youtube
as you can see in the above video, the “10 items or less” check out point works really well UNTILL there is a queue in the lane, what is the cashier would simply tell the person who couldn’t find his credit card to step forward while giving him a ticket to pay and then, the rest of the people waiting in the line can proceed to their checkout? this is what we are doing in X2, every small block is getting acknowledged in at least 2 storage controllers and eventually find its way to the drives, the performance gains of doing so are amazing as evidently can be shown from the real world example above!
below you can see a video i recorded with Zvi Schneider, our chief architect discussion the performance improvements