MLC SSD card lifetime and write amplification

As MLC-based SSD cards are raising popularity, there is also a raising concern how long it can survive. As we know, a MLC NAND module can handle 5,000-10,000 erasing cycles, after which it gets unusable. And obviously the SSD card based on MLC NAND has a limited lifetime. There is a lot of misconceptions and misunderstanding on how long such card can last, so I want to show some calculation to shed a light on this question.
Continue reading

Posted in Uncategorized | 4 Comments

Review of Virident FlashMAX MLC cards

I have been following Virident for a long time (e.g. http://www.mysqlperformanceblog.com/2010/06/15/virident-tachion-new-player-on-flash-pci-e-cards-market/). They have great PCIe Flash cards based on SLC NAND.
I always thought that Virident needed to come up with an MLC card, and I am happy to see they have finally done so.

At Virident’s request, I performed an evaluation of their MLC card to assess how it handles MySQL workload. As I am very satisfied with the results, I wish to share my findings in this post.
Continue reading

Posted in benchmarks, MLC | Leave a comment

Multiple MySQL instances on Fusion-io ioDrive

It is known that MySQL due internal limitations is not able to utilize
all CPU and IO resources available on modern hardware.
Idea is to run multiple instances of MySQL to gain better performance on Fusion-io ioDrive card.

Full report is available in PDF

Continue reading

Posted in Uncategorized | 3 Comments

Intel 320 SSD write performance – contd.

I wrote about Intel 320 SSD write performance before, but I was not satisfied with these results.

Somewhat each time on Intel 320 SSD I was getting different write performance, so it made me looking into this with details.

So let’s run experiment as in previous post, this is sysbench fileio random write on different file size, from 10GiB to 140GiB with 10GiB step. I use ext4 filesystem, and I perform filesystem format before increasing filesize.

The results are pretty much as in previous post, the throughput drops as we increase filesize:

However, there is when interesting stuff begin. Now when we run the same iterations again, the result will look like:

As you see, second time the throughput is much worse, even on medium size files. Just after 50GiB size, throughput gets below 40MiB/sec And this is with the fact, that I perform filesystem format before each run.

This leads me to conclusion that write performance on Intel 320 SSD is decreasing in time, and actually it is quite unpredictable in each given point of time. Filesystem format does not help, and only secure erase procedure allows to return to initial state. There are commands for this procedure for reference.

hdparm --user-master u --security-set-pass Eins /dev/sd$i
hdparm --user-master u --security-erase Eins /dev/sd$i

Discussing this problem with engineers working with Intel 320 SSD drives I was advised to use artificial space provisioning, about 20%. Basically we create partition which takes only 80% of space.

So let’s try this. The experiment the same as previously, with difference that I use 120G partition, and max filesize is 110GiB.

You can see that throughput in first iteration is basically the same as with full drive, but second iteration performs much better. Throughput never drops below 40MiB/sec, and stays on about 50MiB/sec level.

So, I think, this advise to use space provisioning is worth to consider if you want to have some kind of protection and maintain throughput on some level.

Raw results and used scripts as always you can find on our Benchmarks Launchpad



Posted in benchmarks, MLC | 6 Comments

Intel 320 SSD read performance

While PCI-e Flash cards show great performance, I am often asked about alternatives, as price for PCI-e cards is still significant and not acceptable for small companies and startups.

Intel 320 SSD appears to be a popular drive with a quite acceptable price.
I wrote about write performance of these cards, and now let’s take look on a random read workload.

I used a Cisco UCS C250 as base hardware, comparing in it:

  • regular RAID10 over 8 SAS 2.5 disks
  • single Intel 320 SSD directly attached to a HighPoint RocketRAID 2300
  • two Intel 320 SSD in hardware RAID0 mode, attached to a LSI SAS9211-4i controller

For simulating the workload I used sysbench’s fileio random reads. Scripts and raw results available on Launchpad.

Let’s see throughput results:

Throughput, MiB/sec (more is better)
threads Intel 320 Intel 320 2 strip RAID10 ratio Intel 320 / RAID10 ratio Intel 320 2 strip / RAID10
1 30.27 31.18 3.75 8.07 8.31
2 55.18 60.49 6.98 7.91 8.67
4 95.13 112.85 12.10 7.86 9.33
8 143.58 191.64 19.05 7.54 10.06
16 174.75 277.70 26.70 6.54 10.40
32 174.60 351.84 32.90 5.31 10.69

And response times:

95% response time, ms (less is better)
threads Intel 320 SSD Intel 320 SSD strip RAID ratio RAID/Intel 320 ratio RAID/Intel 320 strip
1 0.53 0.56 6.13 11.57 10.95
2 0.72 0.59 7.27 10.10 12.32
4 0.89 0.74 10.07 11.31 13.61
8 1.24 0.95 15.63 12.60 16.45
16 1.76 1.38 25.52 14.50 18.49
32 3.33 2.15 47.35 14.22 22.02

As conclusion, this card provides great read performance. A single card provides 5-8x better throughput and 10-14x better response time. Striping helps to increase throughput in 8-10x and response time in 10-22x.

While there are questions about write performance (see my previous post), I think this card is very suitable for read-intensive tasks, where you can expect significant improvements.




Posted in benchmarks, MLC | 3 Comments

FusionIO 320GB MLC random write performance

I was advised that new drivers and new firmware for FusionIO cards improve performance and stability and it is recommended to review results I’ve got about year ago.

Using the same methodology and the same box as for Intel 320 SSD, I run random writes benchmarks for FusionIO 320GB MLC card (I do not have the card I had year ago on hands).

Information about system, FusionIO drives are raw results are on Benchmarks Launchad

First graph is to show timeline for different filesizes. Benchmark starts just after formatting card and filesystem and runs around 1 hour with measuring throughput each 10 sec.

Interesting to see the same pattern as for Intel 320 SSD: the throughput starts at max, then drops down and after some peak stabilizes.

1500 sec seems enough to get stable line for all filesizes. If we take slice of data after 1500 sec and build summary space->throughput graph, it looks like:

So we still have decent declining line. The throughput drops from 500MiB/sec at peak to 110-120MiB/sec at full capacity.

And to have some fun with R/ggplot2 graphs, let’s build graph to compare FusionIO and Intel 320 SSD (with results from previous post)


Posted in benchmarks, MLC, ssd | 3 Comments

Intel 320 SSD random write performance

While I like performance provided by PCI-E cards like FusionIO or Virident tachIOn, I am often asked about SATA drives alternatives, as price of PCI-E cards often is barrier, especially for startups. There is wide range of SATA drives on market, and it is hard to pick one, but Intel SSD are probably one of most popular, and I’ve got pair of Intel 320 SSD 160GB to play with it.

Probably most interesting characteristic for SSD for me is Random Write throughput in correlation with file size , as it is known that the write throughput declines when you use more space.

 

In this post I will test (using sysbench fileio) single Intel 320 SSD card with different filesize ( from 10 to 140 GiB, with step 10GiB). Filesystem is XFS and IO blocksize is 16KiB.

I posted all scripts and results on our Launchpad project where you can find

I used next methodology for testing: format xfs, run 1 hour random write test with measuring throughput each 10 sec.

The results are bit tricky to analyze, as throughput performs in this way (for 100GiB filesize)

You can see that just after format throughput starts with 80MiB/sec, then drops to 10MiB/sec and after about half of hour stabilizes on 30MiB/sec level.

We can build the same graph (time -> throughput) for all filesizes we have:

where you can see that throughput drops from 100 MiB/sec for 10GiB file to 15MiB/sec for 140GiB file.

For reference I added result from similar benchmark for RAID10 over 8 regular spinning SAS 15K disks, which is around 23MiB/sec.

From graph we see that all results are stabilized after 2500 sec, and if we get slice of data after 2500 sec, the summary graph ( size -> throughput) looks like:

This graph allows to get idea what is throughput for given filesize much easier.

E.g. for 70GiB files, we have 40MiB/sec and for 120GiB file, it is 20MiB/sec.

Some conclusions from these results:

  • Intel 320 SSD performance is affected by amount of used space. The more space used – the worse performance
  • Throughput may drop very intensively, e.g. from 10GiB to 20GiB it drops by 20%
  • When you run benchmark on your own, take into account time needed to get stabilized result. It may take over half of hour for some cases

In final I want to give credit to R projects and ggplot2 which are very helpful for graphical analyzing of data.


Posted in benchmarks, ssd | 15 Comments

MySQL 5.5.8 and Percona Server on Fast Flash card (Virident tachIOn)

Crossposted from MySQLPerformanceBlog.

This is to follow up on my previous post and show the results for MySQL 5.5.8 and Percona Server on the fastest hardware I have in our lab: a Cisco UCS C250 server with 384GB of RAM, powered by a Virident tachIOn 400GB SLC card.

To see different I/O patterns, I used different innodb_buffer_pool_size settings: 13G, 52G, an 144G on a tpcc-mysql workload with 1000W (around 100GB of data). This combination of buffer pool sizes gives us different data/memory ratios (for 13G – an I/O intensive workload, for 52G – half of the data fits into the buffer pool, for 144G – the data all fits into memory). For the cases when the data fits into memory, it is especially important to have big transactional log files, as in these cases the main I/O pressure comes from checkpoint activity, and the smaller the log size, the more I/O per second InnoDB needs to perform.

So let me point out the optimizations I used for Percona Server:

  • innodb_log_file_size=4G (innodb_log_files_in_group=2)
  • innodb_flush_neighbor_pages=0
  • innodb_adaptive_checkpoint=keep_average
  • innodb_read_ahead=none

For MySQL 5.5.8, I used:

  • innodb_log_file_size=2000M (innodb_log_files_in_group=2), as the maximal available setting
  • innodb_buffer_pool_instances=8 (for a 13GB buffer pool); 16 (for 52 and 144GB buffer pools), as it is seems in this configuration this setting provides the best throughput
  • innodb_io_capacity=20000; a difference from the FusionIO case, it gives better results for MySQL 5.5.8.

For both servers I used:

  • innodb_flush_log_at_trx_commit=2
  • ibdata1 and innodb_log_files located on separate RAID10 partitions, InnoDB datafiles on the Virident tachIOn 400G card

The raw results, config, and script are in our Benchmarks Wiki.
Here are the graphs:

13G innodb_buffer_pool_size:

In this case, both servers show a straight line, and it seems having 8 innodb_buffer_pool_instances was helpful.

52G innodb_buffer_pool_size:

144G innodb_buffer_pool_size:

The final graph shows the difference between different settings of innodb_io_capacity for MySQL 5.5.8.

Small innodb_io_capacity values are really bad, while 20000 allows us to get a more stable line.

In summary, if we take the average NOTPM for the final 30 minutes of the runs (to avoid the warmup stage), we get the following results:

  • 13GB: MySQL 5.5.8 – 23,513 NOTPM, Percona Server – 30,436 NOTPM, advantage: 1.29x
  • 52GB: MySQL 5.5.8 – 71,774 NOTPM, Percona Server – 88,792 NOTPM, advantage: 1.23x
  • 144GB: MySQL 5.5.8 – 78,091 NOTPM, Percona Server – 109,631 NOTPM, advantage: 1.4x

This is actually the first case where I’ve seen NOTPM greater than 100,000 for a tpcc-mysql workload with 1000W.

The main factors that allow us to get a 1.4x improvement in Percona Server are:

  • Big log files. Total size of logs are: innodb_log_file_size=8G
  • Disabling flushing of neighborhood pages: innodb_flush_neighbor_pages=0
  • New adaptive checkpointing algorithm innodb_adaptive_checkpoint=keep_average
  • Disabled read-ahead logic: innodb_read_ahead=none
  • Buffer pool scalability fixes (different from innodb_buffer_pool_instances)

We recognize that hardware like the Cisco UCS C250 and the Virident tachIOn card may not be for the mass market yet, but
it is a good choice for if you are looking for high MySQL performance, and we tune Percona Server to get the most from such hardware. Actually, from my benchmarks, I see that the Virident card is not fully loaded, and we may benefit from running two separate instances of MySQL on a single card. This is a topic for another round.

(Edited by: Fred Linhoss)

Posted in benchmarks, SLC | Leave a comment

Write performance on Virident tachIOn card

One of the biggest problems with solid state drives is that write performance may drop significantly with decreasing free space. I wrote about this before (http://www.ssdperformanceblog.com/2010/07/free-space-and-write-performance/), using a
FusionIO 320GB Duo card as the example. In that case, when space utilization increased from 100GB to 200GB, the write performance
dropped 2.6 times.

In this regard, Virident claims that tachIOn cards provide “Sustained, predictable random IOPS – Best in the Industry”. Virident generously provided me model 400GB, and I ran the benchmark using the
same methodology as in my experiment with FusionIO, which was stripped for performance. Also using my script, Virident made runs on tachIOn 200GB and 800GB model cards and shared numbers with me ( to be clear I can certify only numbers for 400GB card, but I do not have reasons to question the numbers for 200GB and 800GB, as they corresponding to my results).

The benchmarks was done on Cisco UCS C250 box and raw results are on Benchmarks Wiki



Visually, the drop is not as drastic as it was in the case using FusionIO, but let’s get some numbers.
I am going to take the performance numbers at the points where the available space of the card is 1/3, 1/2, and 2/3 filled, as well as at the point where the card is full. Then I will compute the ratio of each of those IOS numbers to the IOS at the 1/3 point.

**For the 400GB tachIOn card:**

size Throughput, MiB/sec ratio
130 959.17
200 849.58 1.13
260 685.18 1.40
360 417.33 2.29

That is, at the 2/3 point, the 400GB card is slower by 29% than at the 1/3 point, and at full capacity it is slower by 57%.

Observations from looking at the graph:

* You can also see the card never goes below 400MB/sec, even when working at full capacity. This characteristic (i.e., high throughput at full capacity) is very important to know if you are looking to use an SSD card as a cache layer (say, with FlashCache), as, usually for caching, you will try to fill all available space.
* The ratio between the 1/3 capacity point and full capacity point is much smaller compared with FusionIO Duo (without additional spare capacity reservation).
* Also, looking at the graphs for Virident and comparing with the graphs for FusionIO, one might be tempted to say that Virident just has a lot of space reserved internally which is not exposed to the end user, and this is what they use to guarantee a high level of performance. I checked with Virident and they tell me that this is not the case. Actually from diagnostic info on Wiki page you can see: tachIOn0: 0×8000000000 bytes (512 GB), which I assume total installed memory. Regardless, it is not a point to worry about. For pricing, Virident defines GB as the capacity available for end users. So, a competitive $/GB level is maintained, and it does not matter how much space is reserved internally.

Now it would be interesting to compare results with results for FusionIO. As cards have different capacity I made graph which shows
throughput vs percentage of used capacity for both cards, FusionIO 320GB Duo SLC and Virident 400GB SLC

Util % Duo 320GB tachIOn 400GB advantage percent
20% 1,095 990 90%
30% 1,006 980 97%
40% 825 964 117%
50% 613 872 142%
60% 397 783 197%
70% 308 669 217%
80% 237 611 258%
90% 117 502 429%
100% 99 417 421%

In conclusion:
* On single Virident card I see random write throughput close or over 1GB/sec in with low space usage and it is comparable with throughput I’ve got on stripped FusionIO card. I assume Virident maintain good level of parallelism internally.
* Virident card maintains very good throughput level in close to full capacity mode, and that means you do not need to worry ( or worry less) about space reservation or formatting card with less space.

Posted in benchmarks, SLC | 1 Comment

Free space and write performance

In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let’s run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file – the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.

Posted in benchmarks, MLC | 2 Comments