When it comes to server hardware failures, I've seen them all with our own infrastructure. With the exception of CPUs, I've seen virtually every other component that could fail, fail in the past 16 years of running AnandTech. Motherboards, power supplies, memory and of course, hard drives. 

By far the most frequent failure in our infrastructure had to be mechanical drives. Within the first year after the launch of Intel's X25-M in 2008, I had transitioned all of my testbeds to solid state drives. The combination of performance and reliability was what I needed. Most of my testbeds were CPU bound, so I didn't necessarily need a ton of IO performance - but having the headroom offered by a good SSD meant that I could get more consistent CPU performance results between runs. The reliability side was simple to understand - with a good SSD, I wouldn't have to worry about my drive dying unexpectedly. Living in fear of a testbed hard drive dying over the weekend before a big launch was a thing of the past. 

When it came to rearchitecting the AnandTech server farm, these very same reasons for going the SSD route on all of our testbeds (and personal systems) were just as applicable to the servers that ran AnandTech.

Our infrastructure is split up between front end application servers and back end database servers. With the exception of the boxes that serve our images, most of our front end app servers don't really stress IO all that much. The three 12-core virtualized servers at the front end would normally be fine with some hard drives, however we instead decided to go with mainstream SSDs to lower the risk of a random mechanical failure. We didn't need the endurance of an enterprise drive in these machines since they weren't being written to all that frequently, but we needed reliable drives. Although quite old by today's standards, we settled on 160GB Intel X25-M G2s but partitioned the drives down to 120GB in order to ensure they'd have a very long lifespan.

Where performance matters more is in our back end database servers. We run a combination of MS SQL and MySQL, and our DB workloads are particularly IO intensive. In the old environment we had around a dozen mechanical drives in various RAID configurations powering all of the databases that ran the site. To put performance in perspective, I grabbed our old Forum Database server and took a look at the external SAS RAID array we had created. Until last year, the Forums were powered by a combination of 6 x Seagate Barracuda ES.2s and 4 x Seagate Cheetah 10K.7s. 

For the new Forums DB we moved to 6 x 64GB Intel X25-Es. Again, old by modern standards, but a huge leap above what we had before. To put the performance gains in perspective I ran some of our enterprise IO benchmarks on the old array and the new array to compare. We split the DB workload across the Barracuda ES.2 array (6 drive RAID-10) and the Cheetah array (4 drive RAID-5), however to keep things simple I just created a 4-drive RAID-0 using the Cheetahs which should give us more than a good indication of peak performance of the old hardware:

AnandTech Forums DB IO Performance Comparison - 2013 vs 2007
  MS SQL - Update Daily Stats MS SQL - Weekly Stats Maintenance Oracle Swingbench
Old Forums DB Array (4 x 10K RPM  RAID-0) 146.1 MB/s 162.9 MB/s 2.8 MB/s
New Forums DB Array (6 x X25-E RAID-10) 394.4 MB/s 450.5 MB/s 55.8 MB/s
Performance Increase 2.7x 2.77x 19.9x

The two SQL tests are actually from our own environment, so the performance gains are quite applicable. The advantage here is only around 2.7x. In reality the gains can be even greater, but we don't have good traces of our live DB load - just some of our most IO intensive tasks on the DB servers. The final benchmark however does give us some indication of what a more random enterprise workload can enjoy with a move to SSDs from a hard drive array. Here the performance of our new array is nearly 20x the old HDD array.

Note that there's another simplification that comes along with our move to SSDs: we rely completely on Intel's software RAID. There are no third party RAID controllers, no extra firmware/drivers to manage and validate, and there's no external chassis needed to get more spindles. We went from a 4U HP DL585 server with a 2U Promise Vtrak J310s chassis and 10 hard drives, down to a 2U server with 6 SSDs - and came out ahead in the performance department. Later this week I'll talk about power savings, which ended up being a much bigger deal.

This is just the tip of the iceberg. In our specific configuration we went from old hard drives to old SSDs. With even greater demands you could easily go to truly modern enterprise SSDs or even PCIe based solutions. Using a combination of consumer and enterprise drives isn't a bad idea if you want to transition to an all-SSD architecture. Deploying reliable consumer drives in place of lightly used hard drives is a way to cut down the number of moving parts in your network, while moving to higher performing/higher endurance enterprise SSDs can deliver significant performance benefits as well.

POST A COMMENT

57 Comments

View All Comments

  • canthearu - Wednesday, March 13, 2013 - link

    You are entirely wrong with your approach to hard drives.

    In the world of IT, you start having any trouble with any hard drive or SSD, you pull it and replace it right away, and you either rebuild the array, or you restore from backups.

    The cost in time, and the risk of failure, means that you generally don't attempt to restore a hard drive to working condition, or recover data from a storage device, as anything except a very last resort. If you are forced into doing this, it also means your IT procedures and policies have failed.
    Reply
  • nexox - Wednesday, March 13, 2013 - link

    With well-tested SSDs, running validated firmware, in a proper environment, you're going to see very, very few unexpected failures. Most surprise SSD failures are caused by firmware bugs that are generally tickled by low-quality disk controllers and/or unexpected power loss. If you have high-quality SAS controllers (not necessarily a hardware RAID adapter) and you run plenty of power failure tests against a drive, you can be fairly sure it will last you quite a long time in production.

    And physical durability is actually important - high-density servers that run many spinning drives typically have vibration issues that can significantly shorten drive life span. SSDs are unaffected by vibration, and they don't cause more vibration that can affect any other spinning disks that you may have in your chassis / rack.
    Reply
  • Michael McNamara - Wednesday, March 13, 2013 - link

    Those performance numbers are very impressive... how are you backing up all that data? I'm guessing you have a disk to disk solution.

    Cheers!
    Reply
  • Hrel - Wednesday, March 13, 2013 - link

    You talk about going to SSD's but what about storage space? You said you use 6 SSD's but even that's not much space. What do you to store data? SSD's still aren't any good for that. Reply
  • brshoemak - Wednesday, March 13, 2013 - link

    The previous article that detailed the new hardware indicated that the image hosting duties, were being shifted to a cloud computing company. Images and multimedia usually account for a sizable chunk of a website. The remaining content would not take up THAT much space by comparison.

    I would be interested to know how much data makes up the main site versus the amount of data for the forums.
    Reply
  • andrewbuchanan - Wednesday, March 13, 2013 - link

    Did you choose asp.net because of ease of development, output caching, something else? How do you find Windows vs Linux from a reliability standpoint?

    I have a mixture of both at work and have for the most part been happy with both. Lots of forums bash windows web servers but I've found them to work very well (no worse than lamp) so I'm curious how you're experience has been.
    Reply
  • InternetGeek - Thursday, March 14, 2013 - link

    Anand,

    Do you think you would see any benefits for moving in a cloud infrastructure?. Azure recently started offering full SSD environments, and I believe it would be simple to migrate to such a set up.

    What do you think could be the trade-offs?
    Reply
  • Dradien - Friday, March 15, 2013 - link

    I've been coming to this site since I think 2002 or so. Consistently one of the best sites on the internet. Going beyond which videocard/CPU/SSD/etc etc is better, you seem to delve deep into the how's and why's stuff perform like they do, and it's nice to see that.

    That and you seem like a chill guy Anand, talking to your commenters and audience, seems like you honestly love and care about what you do, and have a passion for it, and I can respect that.
    Reply
  • rschmitty - Wednesday, March 20, 2013 - link

    "we settled on 160GB Intel X25-M G2s but partitioned the drives down to 120GB in order to ensure they'd have a very long lifespan"

    Could someone elaborate on that? How does partitioning ssd give you a longer lifespan?
    Reply
  • Hrobertgar - Wednesday, March 20, 2013 - link

    I assumed the word 'partition' referred to having one 120GB 'partition' and 40GB reserve, something AnandTech has mentioned in many SSD reviews (I think usually in terms of I/O consistency); as oppopsed to having one 120GB partition and a second 40GB partition. Reply

Log in

Don't have an account? Sign up now