Have you ever EOLd an SSD?

I’ve heard and read various things about how SSDs die. But how do they actually die? The failure process seems like it should be more methodical than spindle drives, given less factors, so if you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time.

If that’s correct, then what happens to SSDs in RAID? A mirror configuration isn’t necessarily going to result in the exact same sequences of write operations – which means that, at some point, your drives might become out of sync in-terms of volume sizing. Stripe, at least, has some chance of distributing the wear.

But has anyone actually deliberately EOLd SSDs in various configurations to see whether they die with dignity (and manageable degradation rates) or are people in for a shock when one day they find their SSD raid array is down to 70% capacity and is going to be dead in the water before replacements can arrive?



A good question, to which I’ve yet to see an answer.

I’ve have seen quite a few reports of people who are using them as their boot drives finding that they will suddenly stop booting. However if you connect the driver to another computer you can still read & write to it. A little bit strange!

Guess time will tell.

@alpr I saw that discussion – posted by some first time submitter called “kfsone” :)

please forgive me, i was sure there was something familiar about that post, but my selective reading did not reveal that for me until now. i was under the impression the author is timothy, funny i did not see your name there until you mentioned it. guess some parts of my brain took an out time when stumbling upon this.

to make it up to you, i’ve spend all my mod points (10, did it twice) in your thread. almost 500 replies and some very good and insightful posts, not bad for a first time submitter.

so long Slpr

ROFL! You didn’t have to do that :) To be honest, I was a bit dissapointed with the way I phrased it in that post. I was really aiming for feedback on EOLing industrial/corporate drives vs desktop drives. I’m wondering how things are going to go for folks like CCCP a few years from now – will they have warning or will the drives suddenly start losing diskspace or mirror parity :)

So I’ve been doing a lot of work lately with SSDs. All enterprise type stuff both SAS and SATA. i personally havent had any long enough to fail them but in talking with the OEMs the way these drives fail isn’t graceful per se but it is predictable with the right software in place (i use santools). cells get flaky and or out right fail making reading problematic and writing obviously impossible.

One thing is certain though, people grossly over estimate the volume of writes they do. Making a SSD fail due to cell wear actually takes some effort.

That’s actually part of the reason for my concern, Mad; there are clearly some (plural) number of misunderstandings going on, and I can’t find anyone who, as a consumer, has forced and thus observed a normal EOL of an enterprise unit. It’s mostly theory and third hand.

I’m just a little sceptical when so many people look at a new tech as essentially unsinkable and so are entirely willing to view any interest in a count of lifeboats as irrelevant :)

well, so, i just ran into the only real failure scenario with SSDs on friday. I have a large nexenta install, 720TB raw spinning disk and 35 x 400GB + 9 x 800GB STEC SSDs. along with 8 x 8GB write cache but those don’t really count for space.

anyway, i just put in the 9 x 800GB drives and something wasn’t right. the rest of the system was acting strangely. finally found the problem. one of the new 800GB drives was throwing hardware errors. hardware errors within the SAS/SES ‘chain’ (really more like layer 2 ethernet for disks/jbods/hbas) can cause a lot of ‘wtf?’. pulled that drive and the rest of the system almost instantly started behaving again.

in terms of failures that is really the only running error you’re going to find. the drive is either DOA (never had a DOA from STEC), alive but has error for some reason, alive and working fine.

after that it is your job to watch the firmware stats and trust that your vendor isnt bullshitting you when they say you can write ‘X’ to the drive before it fails.

if your original question stems from a need/requirement for work, meaning someone else is paying for it, then get a hold of the guys at SANTools. their toolset is extremely reasonably priced however it is enough of an expense that even power SOHO users probably can’t justify it. you can do all kinds of really low level monitoring with SANTools. Catch errors the OS never knows about, watch the volume of data written to SSDs, change firmware settings to increase performance etc.

another thing is even the best SSD vendors wont tell you or don’t know how much 100% random data can be written. the volume of data required to fail a cell pretty much dictates that tests be run via mostly (or 100%_ sequential throughput or testing would potentially go from more than a month of 24/7 writing to possibly more than a year for some of the larger SSDs. like STEC’s XE line of drives, you can do 30 full writes per day everyday for 5 years. That test starts to run awhile when you’re talking 800GB drives. The 6gbps link only allows so much data per day right.

FWIW: I’m not remotely connected to the hardware layer in my position at Bliz – this was borne purely out of backlogged curiosity catch up =)

The last paragraph touches on why I think it’s a point of concern – unless people have proofed this stuff on smaller drives, I wonder what degree of certainty people have over failure monitoring on the big stuff :)

It comes down to what is being monitored and what you are seeing. Example, the SANtools software can report recovered errors that the O/S never learns about or reveals, because the details are stored inside the HDD/SSD and the data was recovered internally so there is nothing to report. But, say you bought 8 disks and after a day’s use one of them has thousands of these errors and the others have only a few, or zero.

Writing is on the wall that the one device that threw so many recovered errors in early life is going to be the problem child that needs to go back for a warranty replacement.

(Full disclosure, I am the president of SANtools and write most of the code — P.S.one of the reasons this software gets so much juicy data is because we have developer NDAs with most of the vendors, including STEC, so we have access to info that will never be revealed in any open source product).

Leave a Reply

Name and email address are required. Your email address will not be published.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

You may use these HTML tags and attributes:

<a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <pre> <q cite=""> <s> <strike> <strong> 

%d bloggers like this: