Category Archives: benchmarks

Virident vCache vs. FlashCache: Part 2

This is the second part in a two-part series comparing Virident’s vCache to FlashCache. The first part was focused on usability and feature comparison; in this post, we’ll look at some sysbench test results.

Disclosure: The research and testing conducted for this post were sponsored by Virident.

First, some background information. All tests were conducted on Percona’s Cisco UCS C250 test machine, and both the vCache and FlashCache tests used the same 2.2TB Virident FlashMAX II as the cache storage device. EXT4 is the filesystem, and CentOS 6.4 the operating system, although the pre-release modules I received from Virident required the use of the CentOS 6.2 kernel, 2.6.32-220, so that was the kernel in use for all of the benchmarks on both systems. The benchmark tool used was sysbench 0.5 and the version of MySQL used was Percona Server 5.5.30-rel30.1-465. Each test was allowed to run for 7200 seconds, and the first 3600 seconds were discarded as warmup time; the remaining 3600 seconds were averaged into 10-second intervals. All tests were conducted with approximately 78GiB of data (32 tables, 10M rows each) and a 4GiB buffer pool. The cache devices were flushed to disk immediately prior to and immediately following each test run.

With that out of the way, let’s look at some numbers.

vCache vs. vCache – MySQL parameter testing

The first test was designed to look solely at vCache performance under some different sets of MySQL configuration parameters. For example, given that the front-end device is a very fast PCIe SSD, would it make more sense to configure MySQL as if it were using SSD storage or to just use an optimized HDD storage configuration? After creating a vCache device with the default configuration, I started with a baseline HDD configuration for MySQL (configuration A, listed at the bottom of this post) and then tried three additional sets of experiments. First, the baseline configuration plus:

innodb_read_io_threads = 16
innodb_write_io_threads = 16

We call this configuration B. The next one contained four SSD-specific optimizations based partially on some earlier work that I’d done with this Virident card (configuration C):

innodb_io_capacity = 30000
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages=none
innodb_max_dirty_pages_pct = 60

And then finally, a fourth test (configuration D) which combined the parameter changes from tests B and C. The graph below shows the sysbench throughput (tps) for these four configurations:
vcache_trx_params
As we can see, all of the configuration options produce numbers that, in the absence of outliers, are roughly identical, but it’s configuration C (shown in the graph as the blue line – SSD config) which shows the most consistent performance. The others all have assorted performance drops scattered throughout the graph. We see the exact same pattern when looking at transaction latency; the baseline numbers are roughly identical for all four configurations, but configuration C avoids the spikes and produces a very constant and predictable result.
vcache_response_params

vCache vs. FlashCache – the basics

Once I’d determined that configuration C appeared to produce the most optimal results, I moved on to reviewing FlashCache performance versus that of vCache, and I also included a “no cache” test run as well using the base HDD MySQL configuration for purposes of comparison. Given the apparent differences in time-based flushing in vCache and FlashCache, both cache devices were set up so that time-based flushing was disabled. Also, both devices were set up such that all IO would be cached (i.e., no special treatment of sequential writes) and with a 50% dirty page threshold. Again, for comparison purposes, I also include the numbers from the vCache test where the time-based flushing is enabled.
vcache_fcache_trx_params
As we’d expect, the HDD-only solution barely registered on the graph. With a buffer pool that’s much smaller than the working set, the no-cache approach is fairly crippled and ineffectual. FlashCache does substantially better, coming in at an average of around 600 tps, but vCache is about 3x better. The interesting item here is that vCache with time-based flushing enabled actually produces better and more consistent performance than vCache without time-based flushing, but even at its worst, the vCache test without time-based flushing still outperforms FlashCache by over 2x, on average.

Looking just at sysbench reads, vCache with time-based flushing consistently hit about 27000 per second, whereas without time-based flushing it averaged about 12500. FlashCache came in around 7500 or so. Sysbench writes came in just under 8000 for vCache + time-based flushing, around 6000 for vCache without time-based flushing, and somewhere around 2500 for FlashCache.
vcache_fcache_read_write

We can take a look at some vmstat data to see what’s actually happening on the system during all these various tests. Clockwise from the top left in the next graph, we have “no cache”, “FlashCache”, “vCache with no time-based flushing”, and “vCache with time-based flushing.” As the images demonstrate, the no-cache system is being crushed by IO wait. FlashCache and vCache both show improvements, but it’s not until we get to vCache with the time-based flushing that we see some nice, predictable, constant performance.
cpu-usage-all

So why is it the case that vCache with time-based flushing appears to outperform all the rest? My hypothesis here is that time-based flushing allows the backing store to be written to at a more constant and, potentially, submaximal, rate compared to dirty-page-threshold flushing, which kicks in at a given level and then attempts to flush as quickly as possible to bring the dirty pages back within acceptable bounds. This is, however, only a hypothesis.

vCache vs. FlashCache – dirty page threshold

Finally, we examine the impact of a couple of different dirty-page ratios on device performance, since this is the only parameter which can be reliably varied between the two in the same way. The following graph shows sysbench OLTP performance for FlashCache vs. vCache with a 10% dirty threshold versus the same metrics at a 50% dirty threshold. Time-based flushing has been disabled. In this case, both systems produce better performance when the dirty-page threshold is set to 50%, but once again, vCache at 10% outperforms FlashCache at 10%.

vcache-dirty_trx_params

The one interesting item here is that vCache actually appears to get *better* over time; I’m not entirely sure why that’s the case or at what point the performance is going to level off since these tests were all run for 2 hours anyway, but I think the overall results still speak for themselves, and even with a vCache volume where the dirty ratio is only 10%, such as might be the case where a deployment has a massive data set size in relation to both the working set and the cache device size, the numbers are encouraging.

Conclusion

Overall, the I think the graphs speak for themselves. When the working set outstrips the available buffer pool memory but still fits into the cache device, vCache shines. Compared to a deployment with no SSD cache whatsoever, FlashCache still does quite well, massively outperforming the HDD-only setup, but it doesn’t even really come close to the numbers obtained with vCache. There may be ways to adjust the FlashCache configuration to produce better or more consistent results, or results that are more inline with the numbers put up by vCache, but when we consider that overall usability was one of the evaluation points and combine that with the fact that the best vCache performance results were obtained with the default vCache configuration, I think vCache can be declared the clear winner.

Base MySQL & Benchmark Configuration

All benchmarks were conducted with the following:

sysbench ­­--num­-threads=32 ­­--test=tests/db/oltp.lua ­­--oltp_tables_count=32 \
--oltp­-table­-size=10000000 ­­--rand­-init=on ­­--report­-interval=1 ­­--rand­-type=pareto \
--forced­-shutdown=1 ­­--max­-time=7200 ­­--max­-requests=0 ­­--percentile=95 ­­\
--mysql­-user=root --mysql­-socket=/tmp/mysql.sock ­­--mysql­-table­-engine=innodb ­­\
--oltp­-read­-only=off run

The base MySQL configuration (configuration A) appears below:

#####fixed innodb options 
innodb_file_format = barracuda 
innodb_buffer_pool_size = 4G 
innodb_file_per_table = true 
innodb_data_file_path = ibdata1:100M
innodb_flush_method = O_DIRECT 
innodb_log_buffer_size = 128M 
innodb_flush_log_at_trx_commit = 1 
innodb_log_file_size = 1G 
innodb_log_files_in_group = 2 
innodb_purge_threads = 1 
innodb_fast_shutdown = 1 
#not innodb options (fixed) 
back_log = 50 
wait_timeout = 120 
max_connections = 5000 
max_prepared_stmt_count=500000 
max_connect_errors = 10 
table_open_cache = 10240 
max_allowed_packet = 16M 
binlog_cache_size = 16M 
max_heap_table_size = 64M 
sort_buffer_size = 4M 
join_buffer_size = 4M 
thread_cache_size = 1000 
query_cache_size = 0 
query_cache_type = 0 
ft_min_word_len = 4 
thread_stack = 192K 
tmp_table_size = 64M 
server­id = 101 
key_buffer_size = 8M 
read_buffer_size = 1M 
read_rnd_buffer_size = 4M 
bulk_insert_buffer_size = 8M 
myisam_sort_buffer_size = 8M 
myisam_max_sort_file_size = 10G 
myisam_repair_threads = 1 
myisam_recover 

Virident vCache vs. FlashCache: Part 1

This is part one of a two part series.

Over the past few weeks I have been looking at a preview release of Virident’s vCache software, which is a kernel module and set of utilities designed to provide functionality similar to that of FlashCache. In particular, Virident engaged Percona to do a usability and feature-set comparison between vCache and FlashCache and also to conduct some benchmarks for the use case where the MySQL working set is significantly larger than the InnoDB buffer pool (thus leading to a lot of buffer pool disk reads) but still small enough to fit into the cache device. In this post and the next, I’ll present some of those results.

Disclosure: The research and testing for this post series was sponsored by Virident.

Usability is, to some extent, a subjective call, as I may have preferences for or against a certain mode of operation that others may not share, so readers may have a different opinion than mine, but on this point I call it an overall draw between vCache and FlashCache.

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Stay tuned for part two of this series, wherein we’ll take a look at some benchmarks. There’s no razor-thin margin of victory for either side here: vCache outperforms FlashCache by a landslide.

Testing the Micron P320h

The Micron P320h SSD is an SLC-based PCIe solid-state storage device which claims to provide the highest read throughput of any server-grade SSD, and at Micron’s request, I recently took some time to put the card through its paces, and the numbers are indeed quite impressive.

For reference, the benchmarks for this device were performed primarily on a Dell R710 with 192GB of RAM and two Xeon E5-2660 processors that yield a total of 32 virtual cores.  This is the same machine which was used in my previous benchmark run.  A small handful of additional tests were also performed using the Cisco UCS C250. The operating system in use was CentOS 6.3, and for the sysbench fileIO tests, the EXT4 filesystem was used.  The card itself is the 700GB model.

So let’s take a look at the data.

With the sysbench fileIO test in asynchronous mode, read performance is an extremely steady 3202MiB/sec with almost no deviation. Write performance is also both very strong and very steady, coming in at a bit over 1730MiB/sec with a standard deviation of a bit less than 13MiB/sec.

realssd-asyncIO

When we calculate in the fact that the block size in use here is 16KiB, these numbers equate to over 110,000 write IOPS and almost 205,000 read IOPS.

When we switch over to synchronous IO, we find that the card is quite capable of matching the asynchronous performance:

syncIO-throughput

Synchronous read reaches peak capacity somewhere between 32 and 64 threads, and synchronous write tops out somewhere between 64 and 128 threads. The latency numbers are equally impressive; the next two graphs show 95th and 99th-percentile response time, but there really isn’t much difference between the two.

syncIO-latency

At 64 read threads, we reach peak performance with latency of roughly 0.5 milliseconds; and at 128 write threads we have maximum throughput with latency just over 3ms.

How well does it perform with MySQL?  Exact results vary, depending upon the usual factors (read/write ratio, working set size, buffer pool size, etc.) but overall the card is extremely quick and handily outperforms the other cards that it was tested against. For example, in the graph below we compare the performance of the P320h on a standard TPCC-MySQL test to the original FusionIO and the Intel i910 with assorted buffer pool sizes:

tpcc-mysql-devicecompare

And in this graph we look at the card’s performance on sysbench OLTP:

sysbench-oltp-ext4xfs

It is worth noting here that EXT4 outperforms XFS by a fairly significant margin. The approximate raw numbers, in tabular format, are:

- EXT4 XFS
13GiB BP 22000 7500
25GiB BP 17000 9000
50GiB BP 21000 11000
75GiB BP 25000 15000
100GiB BP 31000 19000
125GiB BP 36000 25000

In the final analysis, there may or may not be faster cards out there, but the Micron P320h is the fastest one that I have personally seen to date.

Testing the Virident FlashMax II

Approximately 11 months ago, Vadim reported some test results from the Virident FlashMax 1400M, an MLC PCIe SSD device. Since that time, Virident has released the FlashMAX II, which promises both increased capacity and increased performance over the previous model. In this post, we present some benchmark results comparing this new model to its predecessor, and we find that indeed, the FlashMax II is a significant upgrade.

For reference, all of the FlashMax II benchmarks were performed on a Dell R710 with 192GB of RAM. This is a dual-socket Xeon E5-2660 machine with 16 physical and 32 virtual cores. (I had originally planned to use the Cisco UCS C250 that is often used for our test runs, but that machine ran into some unrelated hardware difficulties and was ultimately unavailable.)  The operating system in use was CentOS 6.3, and the filesystem used for the test was XFS, mounted with both the noatime,nodiratime options.  The card was physically formatted back to factory default settings in between the synchronous and asynchronous test suites. Note that factory default settings for the FlashMax II will cause it to be formatted in “maxcapacity” mode rather than “maxperformance” mode (maxperformance reserves some additional space internally to provide better write performance). In “maxcapacity” mode, the device tested provides approximately 2200GB of space. In “maxperformance” mode, it’s a bit less than 1900GB.

Without further ado, then, here are the numbers.

First, asynchronous random writes:

async-rndwr-warmup-lg

There is a warmup period of around 18 minutes or so, and after about 45 minutes the performance stabilizes and remains effectively constant, as shown by the next graph.

async-rndwr-8-64-lg

Once the write performance reaches equilibrium, it does so at just under 780MiB/sec, which is approximately 40% higher than the 550MiB/sec exhibited by the FlashMax 1400M.

Asynchronous random read is up next:

async-rndrd-128-lg

The behavior of the FlashMax II is very similar to that of the FlashMax 1400M in terms of predictable performance; the standard deviation on the asynchronous random read throughput measurement is only 5.7MiB/sec.  However, the overall read throughput is over 1000MiB/sec better with the FlashMax II: we see a read throughput of approximately 2580MiB/sec vs. 1450MiB/sec with the previous generation of hardware, an improvement of roughly 80%.

Finally, we take a look at synchronous random read.

sync-rndrd-128-lgAt 256 threads, read throughput tops out at 2090MiB/sec, which is about 20% less than the asynchronous results; given the small bump in throughput going from 128 to 256 threads and the doubling of latency that was also introduced there, this is likely about as good as it is going to get.

For comparison, the FlashMax 1400M synchronous random read test stopped after 64 threads, reaching a synchronous random read throughput of 1345MiB/sec and a 95th-percentile latency of 1.49ms.  With those same 64 threads, the FlashMax II reaches 1883MiB/sec with a 95th-percentile latency of 1.105ms.  This represents approximately 40% more throughput, 25% faster.

In every area tested, the FlashMax II outperforms the original FlashMax 1400M by a significant margin, and can be considered a worthy successor.

Intel SSD 910 in tpcc-mysql benchmark

I continue my benchmarks of Intel SSD 910, the raw IO results are available in my previous experiment. Now I want to test this card under MySQL workload to see if the card is suitable to use with MySQL.

  • Benchmark date: Sep-2012
  • Benchmark goal: Test Intel SSD 910 under tpcc-mysql workload and compare with baseline Fusion-io ioDrive card
  • Hardware specification
    • Server: Dell PowerEdge R710
    • CPU: 2x Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
    • Memory: 192GB
    • Storage: Fusion-io ioDrive 640GB, Intel SSD 910 (software RAID over 2x200GB devices)
    • Filesystem: ext4
  • Software
    • OS: Ubuntu 12.04.1
    • MySQL Version: Percona Server 5.5.27-28.1
  • Benchmark specification
    • Benchmark name: tpcc-mysql
    • Scale factor: 2500W (~250GB of data)
    • Benchmark length: 2h, but the result is taken only for last 1h to remove warm-up phase
  • Parameters to vary: we vary innodb_buffer_pool_size: 13, 25, 50, 75GB to have different memory/data ration. And we test it on two storages: Fusion-io ioDrive and Intel SSD 910
  • Results
    There is graph of Throughput taken every 10 sec:

    Jitter graph:

    Or to have final results I take total amount of transactions for 1h:

    BP size Fusion-io Intel SSD 910 Ratio (fio/i910)
    13 GB 397157 352750 1.13
    25 GB 724011 497769 1.45
    50 GB 1466559 1124223 1.30
    75 GB 2464135 1939415 1.27

    Conclusion

    In conclusion I see that Intel SSD 910 handles MySQL workload quite well, I did not face any problem working with this card.
    Level of stability of results is about the same as with Fusion-io card. The performance of Intel SSD 910 is about ~30% worse, but
    it is expected for this price level. I think Intel SSD 910 is suitable to use with MySQL / Percona Server.

    Link to raw results and stats
    Raw results, config, OS and MySQL metrics are available from Benchmarks Launchpad.

    Testing Fusion-io ioDrive – now with driver 3.1

    In my previous post with results for Fusion-io ioDrive we saw some instability in results, I was pointed that it may be fixed in new drivers VSL 3.1.1. I am not sure if this driver is available for everyone – if you are interested, please contact your Fusion-io support representative. I installed new drivers and firmware, and in fact, the result improved.
    Continue reading