Virident vCache vs. FlashCache: Part 2

This is the second part in a two-part series comparing Virident’s vCache to FlashCache. The first part was focused on usability and feature comparison; in this post, we’ll look at some sysbench test results.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250 test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options 
innodb_file_format = barracuda 
innodb_buffer_pool_size = 4G 
innodb_file_per_table = true 
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT 
innodb_log_buffer_size = 128M 
innodb_flush_log_at_trx_commit = 1 
innodb_log_file_size = 1G 
innodb_log_files_in_group = 2 
innodb_purge_threads = 1 
innodb_fast_shutdown = 1 
#not innodb options (fixed) 
back_log = 50 
wait_timeout = 120 
max_connections = 5000 
max_prepared_stmt_count=500000 
max_connect_errors = 10 
table_open_cache = 10240 
max_allowed_packet = 16M 
binlog_cache_size = 16M 
max_heap_table_size = 64M 
sort_buffer_size = 4M 
join_buffer_size = 4M 
thread_cache_size = 1000 
query_cache_size = 0 
query_cache_type = 0 
ft_min_word_len = 4 
thread_stack = 192K 
tmp_table_size = 64M 
server­id = 101 
key_buffer_size = 8M 
read_buffer_size = 1M 
read_rnd_buffer_size = 4M 
bulk_insert_buffer_size = 8M 
myisam_sort_buffer_size = 8M 
myisam_max_sort_file_size = 10G 
myisam_repair_threads = 1 
myisam_recover 

Virident vCache vs. FlashCache: Part 1

This is part one of a two part series.

Over the past few weeks I have been looking at a preview release of Virident’s vCache software, which is a kernel module and set of utilities designed to provide functionality similar to that of FlashCache. In particular, Virident engaged Percona to do a usability and feature-set comparison between vCache and FlashCache and also to conduct some benchmarks for the use case where the MySQL working set is significantly larger than the InnoDB buffer pool (thus leading to a lot of buffer pool disk reads) but still small enough to fit into the cache device. In this post and the next, I’ll present some of those results.

Disclosure: The research and testing for this post series was sponsored by Virident.

Usability is, to some extent, a subjective call, as I may have preferences for or against a certain mode of operation that others may not share, so readers may have a different opinion than mine, but on this point I call it an overall draw between vCache and FlashCache.

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Stay tuned for part two of this series, wherein we’ll take a look at some benchmarks. There’s no razor-thin margin of victory for either side here: vCache outperforms FlashCache by a landslide.

Testing the Micron P320h

The Micron P320h SSD is an SLC-based PCIe solid-state storage device which claims to provide the highest read throughput of any server-grade SSD, and at Micron’s request, I recently took some time to put the card through its paces, and the numbers are indeed quite impressive.

For reference, the benchmarks for this device were performed primarily on a Dell R710 with 192GB of RAM and two Xeon E5-2660 processors that yield a total of 32 virtual cores.  This is the same machine which was used in my previous benchmark run.  A small handful of additional tests were also performed using the Cisco UCS C250. The operating system in use was CentOS 6.3, and for the sysbench fileIO tests, the EXT4 filesystem was used.  The card itself is the 700GB model.

So let’s take a look at the data.

With the sysbench fileIO test in asynchronous mode, read performance is an extremely steady 3202MiB/sec with almost no deviation. Write performance is also both very strong and very steady, coming in at a bit over 1730MiB/sec with a standard deviation of a bit less than 13MiB/sec.

realssd-asyncIO

When we calculate in the fact that the block size in use here is 16KiB, these numbers equate to over 110,000 write IOPS and almost 205,000 read IOPS.

When we switch over to synchronous IO, we find that the card is quite capable of matching the asynchronous performance:

syncIO-throughput

Synchronous read reaches peak capacity somewhere between 32 and 64 threads, and synchronous write tops out somewhere between 64 and 128 threads. The latency numbers are equally impressive; the next two graphs show 95th and 99th-percentile response time, but there really isn’t much difference between the two.

syncIO-latency

At 64 read threads, we reach peak performance with latency of roughly 0.5 milliseconds; and at 128 write threads we have maximum throughput with latency just over 3ms.

How well does it perform with MySQL?  Exact results vary, depending upon the usual factors (read/write ratio, working set size, buffer pool size, etc.) but overall the card is extremely quick and handily outperforms the other cards that it was tested against. For example, in the graph below we compare the performance of the P320h on a standard TPCC-MySQL test to the original FusionIO and the Intel i910 with assorted buffer pool sizes:

tpcc-mysql-devicecompare

And in this graph we look at the card’s performance on sysbench OLTP:

sysbench-oltp-ext4xfs

It is worth noting here that EXT4 outperforms XFS by a fairly significant margin. The approximate raw numbers, in tabular format, are:

- EXT4 XFS
13GiB BP 22000 7500
25GiB BP 17000 9000
50GiB BP 21000 11000
75GiB BP 25000 15000
100GiB BP 31000 19000
125GiB BP 36000 25000

In the final analysis, there may or may not be faster cards out there, but the Micron P320h is the fastest one that I have personally seen to date.

Testing the Virident FlashMax II

Approximately 11 months ago, Vadim reported some test results from the Virident FlashMax 1400M, an MLC PCIe SSD device. Since that time, Virident has released the FlashMAX II, which promises both increased capacity and increased performance over the previous model. In this post, we present some benchmark results comparing this new model to its predecessor, and we find that indeed, the FlashMax II is a significant upgrade.

For reference, all of the FlashMax II benchmarks were performed on a Dell R710 with 192GB of RAM. This is a dual-socket Xeon E5-2660 machine with 16 physical and 32 virtual cores. (I had originally planned to use the Cisco UCS C250 that is often used for our test runs, but that machine ran into some unrelated hardware difficulties and was ultimately unavailable.)  The operating system in use was CentOS 6.3, and the filesystem used for the test was XFS, mounted with both the noatime,nodiratime options.  The card was physically formatted back to factory default settings in between the synchronous and asynchronous test suites. Note that factory default settings for the FlashMax II will cause it to be formatted in “maxcapacity” mode rather than “maxperformance” mode (maxperformance reserves some additional space internally to provide better write performance). In “maxcapacity” mode, the device tested provides approximately 2200GB of space. In “maxperformance” mode, it’s a bit less than 1900GB.

Without further ado, then, here are the numbers.

First, asynchronous random writes:

async-rndwr-warmup-lg

There is a warmup period of around 18 minutes or so, and after about 45 minutes the performance stabilizes and remains effectively constant, as shown by the next graph.

async-rndwr-8-64-lg

Once the write performance reaches equilibrium, it does so at just under 780MiB/sec, which is approximately 40% higher than the 550MiB/sec exhibited by the FlashMax 1400M.

Asynchronous random read is up next:

async-rndrd-128-lg

The behavior of the FlashMax II is very similar to that of the FlashMax 1400M in terms of predictable performance; the standard deviation on the asynchronous random read throughput measurement is only 5.7MiB/sec.  However, the overall read throughput is over 1000MiB/sec better with the FlashMax II: we see a read throughput of approximately 2580MiB/sec vs. 1450MiB/sec with the previous generation of hardware, an improvement of roughly 80%.

Finally, we take a look at synchronous random read.

sync-rndrd-128-lgAt 256 threads, read throughput tops out at 2090MiB/sec, which is about 20% less than the asynchronous results; given the small bump in throughput going from 128 to 256 threads and the doubling of latency that was also introduced there, this is likely about as good as it is going to get.

For comparison, the FlashMax 1400M synchronous random read test stopped after 64 threads, reaching a synchronous random read throughput of 1345MiB/sec and a 95th-percentile latency of 1.49ms.  With those same 64 threads, the FlashMax II reaches 1883MiB/sec with a 95th-percentile latency of 1.105ms.  This represents approximately 40% more throughput, 25% faster.

In every area tested, the FlashMax II outperforms the original FlashMax 1400M by a significant margin, and can be considered a worthy successor.

On SSDs – Lifespans, Health Measurement and RAID

Solid State Drive (SSD) have made it big and have made their way not only in desktop computing but also in mission-critical servers. SSDs have proved to be a break-through in IO performance and leave HDD far far behind in terms of Random IO performance. Random IO is what most of the database administrators would be concerned about as that is 90% of the IO pattern visible on database servers like MySQL. I have found Intel 520-series and Intel 910-series to be quite popular and they do give very good numbers in terms of Random IOPS. However, its not just performance that you should be concerned about, failure predictions and health gauges are also very important, as loss of data is a big NO-NO. There is a great deal of misconception about the endurance level of SSD, as its mostly compared to rotating disks even when measuring endurance levels, however, there is a big difference in how both SSD and HDD work, and that has a direct impact on the endurance level of SSD.

I will mostly be taling about MLC SSD, now let’s start off with a SSD primer.

SSD Primer

The smallest unit of SSD storage that can be read or written to is a page which is typically 4KB or 8KB in size. These pages are typically organized into blocks which are between 256KB or 1MB in size. SSDs have no mechanical parts and no heads or anything and their is no seeks needed as in conventional rotating disks. Reads involve reading pages from the SSD, however its the writes that are more tricky. Once you write to a page on SSD, you cannot simply overwrite (if you want to write new data) it in the same way you do with a HDD. Instead, you must erase the contents and then write again. However, a SSD can only do erasures at the block level and not the page level. What this means is that the SSD must relocate any valid data in the block to be erased, before the block can be erased and have new data written to it. To summarize, writes mean erase+write. Nowadays, SSD controllers are intelligent and do erasures in the background, so that the latency of the write operation is not affected. These background erasures are typically done within a process known garbage collection. You can imagine if these erasures were not done in the background, then writes would be too slow.

Of course every SSD has a lifespan after which it can be seen as unusable, let’s see what factors matter here.

SSD Lifespans

The lifespan of blocks that make up a SSD is really the number of times erasures and writes can be performed on those blocks. The lifespan is measure in terms of erase/write cycles. Typically enterprise grade MLC SSDs have a lifespan of about 30000 erase/write cycles, while consumer grade MLC SSD have a life span of 5000 to 10000 erase/write cycles. This fact makes it clear that the lifespan of a SSD depends on how much time it is written to. If you have a write-intensive workload then you should expect the SSD to fail much more quickly, in comparison to a read-heavy workload. This is by design.
To offset this behaviour of writes reducing the life of a SSD, engineers use two techniques, wear-levelling and over-provisioning. Wear-levelling works by making sure that all the blocks in a SSD are erased and written to in a evenly distributed fashion, this makes sure that some blocks do not die quickly then other blocks. Over-provisioning SSD capacity is one another technique that increases SSD endurance. This is accomplished by having a large population of blocks to distribute erases and writes over time (bigger capacity SSD), and by providing a large spare area. Many SSD models over provision the space, for example a 80GB SSD could have 10GB of over-provisioned space, so that while it is actually 90GB in size it is reported as a 80GB SSD. While this over-provisioning is done by the SSD manufacturers, this can also be done by not utilising the entire SSD, for example partitioning the SSD in such a way that you only partition about 75% to 80% of the SSD and leave the rest as RAW space that is not visible to the OS/filesystem. So while over-provisioning takes away some part of the disk capacity, it gives back in terms of increased endurance and performance.

Now comes the important part of the post that I would like to discuss.

Health Measurement and failure predictability

As you may have noticed after reading the above part of this post, its all the more important to be able to predict when a SSD would fail and to be able to see health related information about the SSD. Yet I haven’t found much written about how to gauge the health of a SSD. RAID controllers employed with SSD tend to be very limited in terms of the amount of information that they provide about an SSD that could allow predicting when a SSD could fail. However, most of the SSD provide a lot of information via S.M.A.R.T. and this can be leveraged to good affect.
Let’s consider the example of Intel SSD, these SSD have to S.M.A.R.T. attributes that can be leveraged to predict when the SSD would fail. These attributes are:

  • Available_Reservd_Space: This attribute reports the number of reserve blocks remaining. The value of the attribute starts at 100, which means that the reserved space is 100 percent available. The threshold value for this attribute is 10 which means 10 percent availability, which indicates that the drive is close to its end of life.
  • Media_Wearout_Indicator: This attribute reports the number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles. Once the value of this attribute reaches 1, the number will not decrease, although it is likely that significant additional wear can be put on the device. A value of 1 should be thought of as the threshold value for this attribute.

Using the smartctl tool (part of the smartmontools package) we can very easily read the values of these attributes and then use it to predict failures. For example for SATA SSD drives attached to a LSI Megaraid controller, we could very easily read the values of those attributes using the following bash snippet:

Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Available_Reservd_Space" | awk '{print $4}')
Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Media_Wearout_Indicator" | awk '{print $4}') 

Then the above information can be used in different fashions, we could raise an alert if its nearing the threshold value, or measure how quickly the values decrease and then use the rate of decrease to estimate when the drive could fail.

SSDs and RAID levels

RAID have been typically with HDD used for data protection via redundancy and for increased performance, and they have found their use with SSD as well. Its common to see RAID level 5 or 6 being used with SSD on mixed read/write workloads, because the write penalty visible by using these level with rotating disks, is not of that extent when talking about SSD because there is no disk seek involved, so the read-modify-write cycle typically involved with parity based RAID levels does not cause a lot of performance hit. On the other hand striping and mirroring does improve the read performance of the SSD a lot and redundant arrays using SSD deliver far better performance as compared to HDD arrays.
But what about data protection? Do the parity-based RAID levels and mirroring provide the same level of data protection for SSDs as they are thought of? I am skeptical about that, because as I have mentioned above the endurance of a SSD depends a lot on how much it has been written to. In parity-based RAID configurations, a lot of extra writes are generated because of parity changes and they of course decrease the lifespan of the SSD, similarly in the case of mirroring, I am not sure it can provide any benefit in case of wearing out of SSD, if both the SSD in the mirror configuration have the same age, why? Because in mirroring both the SSDs in the array would be receiving the same amount of writes and hence the lifespan would decrease at the same amount of time.
I would think that there is some drastic changes that are needed to the thought process when thinking of data protection and RAID levels, because for me parity-based configuration or mirroring configuration are not going to provide any extra data protection in cases where the SSD used are of similar ages. It might actually be a good idea to periodically replace drives with younger ones so as to make sure that all the drives do not age together.

I would like to know what my readers think!

Intel SSD 910 in tpcc-mysql benchmark

I continue my benchmarks of Intel SSD 910, the raw IO results are available in my previous experiment. Now I want to test this card under MySQL workload to see if the card is suitable to use with MySQL.

  • Benchmark date: Sep-2012
  • Benchmark goal: Test Intel SSD 910 under tpcc-mysql workload and compare with baseline Fusion-io ioDrive card
  • Hardware specification
    • Server: Dell PowerEdge R710
    • CPU: 2x Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
    • Memory: 192GB
    • Storage: Fusion-io ioDrive 640GB, Intel SSD 910 (software RAID over 2x200GB devices)
    • Filesystem: ext4
  • Software
    • OS: Ubuntu 12.04.1
    • MySQL Version: Percona Server 5.5.27-28.1
  • Benchmark specification
    • Benchmark name: tpcc-mysql
    • Scale factor: 2500W (~250GB of data)
    • Benchmark length: 2h, but the result is taken only for last 1h to remove warm-up phase
  • Parameters to vary: we vary innodb_buffer_pool_size: 13, 25, 50, 75GB to have different memory/data ration. And we test it on two storages: Fusion-io ioDrive and Intel SSD 910
  • Results
    There is graph of Throughput taken every 10 sec:

    Jitter graph:

    Or to have final results I take total amount of transactions for 1h:

    BP size Fusion-io Intel SSD 910 Ratio (fio/i910)
    13 GB 397157 352750 1.13
    25 GB 724011 497769 1.45
    50 GB 1466559 1124223 1.30
    75 GB 2464135 1939415 1.27

    Conclusion

    In conclusion I see that Intel SSD 910 handles MySQL workload quite well, I did not face any problem working with this card.
    Level of stability of results is about the same as with Fusion-io card. The performance of Intel SSD 910 is about ~30% worse, but
    it is expected for this price level. I think Intel SSD 910 is suitable to use with MySQL / Percona Server.

    Link to raw results and stats
    Raw results, config, OS and MySQL metrics are available from Benchmarks Launchpad.

    Testing Intel® SSD 910

    Intel came on PCI-e SSD market with their Intel SSD 910 card. With a slogan “The ultimate data center SSD” I assume Intel targets rather a server grade hardware, not consumer level.
    I’ve got one of this card into our lab. I should say it is very price competitive, comparing with other enterprise level PCIe vendors. For a 400GB card I paid $2100, which gives $5.25/GB. Of course I’ve got some performance numbers I’d like to share.

    But before that, few words on the card internals. Intel puts separate 200GB modules, so 400GB card is visible as 2 x 200GB devices in operation system, and 800GB card is visible as 4 different devices. After that you can do software raid0, raid1 or raid10, whatever you prefer.

    For my tests I used single 200GB device and pair combined in software raid0 (Duo).

    For raw performance IO I follow scripts I used for other reviews, i.e. Testing Intel SSD 520

    First results are for asynchronous writes:

    The result averages at 150 MiB/sec for single device and at 250 MiB/sec for Duo.
    I find it interesting, as on SATA based Intel 520 I was able to get 300 MiB/sec.

    Now asynchronous reads:

    The result line is quite stable and is 270 MiB/sec for single drive, and 530 MiB/sec for Duo.
    In the same workload for Intel 520 : 370 MiB/sec.

    Now we are getting to synchronous reads, to see how many threads we need to reach peak throughput and check corresponding response times:

    Throughput:

    Response time:

    I would say for single device the throughput peaking at 8 threads with 95% response time 0.68ms, and for Duo at 16 threads with 0.84ms

    In conclusion I can say that I have mixed feelings after this experiment. On the one hand the performance results are definitely lower than on alternative PCIe cards available on market, but on the other hand the price is absolutely attractive.

    I am going to run more corresponding MySQL-based benchmarks to see how the card is compared to alternatives under database workload.