A Brand New Architecture

To understand how the S3700 is different, we need to revisit how SSDs work. I've done this several times over the years so I'll keep it as succinct as possible here. SSDs are made up of a bunch of NAND packages, each with 1 - 8 NAND die per package, with each die made of multiple planes, blocks and finally pages.

NAND is solid-state, non-volatile memory (data is retained even when power is removed, courtesy of some awesome physics). There are no moving parts, and accesses are very memory-like which delivers great sequential and random IO performance. The downside is NAND has some very strict guidelines dictating how it is written to and erased.

The first thing to know about NAND flash is that you can only write to the same NAND cell a finite number of times. The total amount of charge stored in a NAND cell is counted in dozens of electrons. The tunneling process that places the electrons on the floating gate (thus storing data) weakens the silicon oxide insulation layer that keeps the charge there. Over time, that layer degrades to the point where the cell can no longer store data, and it has to be marked as bad/unusable.

The second principle of dealing with NAND is that you can only write to NAND at the page level. In modern drives that's a granularity of 8KB.

The final piece of the puzzle, and the component that makes all of this a pain to deal with is that you can only erase NAND at the block level, which for Intel's 25nm NAND is 256 pages (2048KB).

Modern SSDs present themselves just like hard drives do, as a linear array of logical block addresses. The OS sends an address and command to the SSD, and the controller translates that address to a physical location in NAND.

When writing to an SSD, the SSD controller must balance its desire for performance (striping writes across as many parallel NAND die as possible) with the goal of preserving NAND lifespan by writing to all cells evenly (wear leveling).

As writes come in, new pages are allocated from a pool of free blocks. As the process of erasing a NAND cell reduces endurance, a good SSD controller will prefer allocating an empty page for new data over erasing an old block. Eventually the controller will run out of clean/empty pages to write to and will have to recycle an old block filled (sometimes only partially) with invalid data to keep operating. This process can reduce overall performance and increase wear on the NAND.

When writing sequential data to an SSD it's easy to optimize for performance. Transfers can be broken up and striped across all available NAND die. Reading the data back is perfectly optimized for high performance as well. It's random IO that causes a problem for performance. Writes to random LBA locations are combined and sent out as burst traffic to look sequential, however the mapping of those LBAs to physical NAND locations can leave the drive in a very fragmented state. With enough random data fragmented on a drive, all write performance will suffer as the controller will no longer be able to quickly allocate large contiguous blocks of free pages across all NAND die.

SSD in a fragmented state, white blocks represent free pages, Xes represent invalid data, colored blocks are valid data - more detail here

Modern SSD controllers will attempt to defragment themselves either while the drive is in use, or during periods of idle time (hence the phrase idle garbage collection). Adequate defragmentation is necessary to maintain a drive's performance even after it has been used for a while. The best controllers do a great job of defragmenting themselves as they work, while the worst allow internal fragmentation to get out of hand.

With that recap out of the way, let's talk about how Intel's first and second generation SSD controllers worked.

The Indirection Table

There never was a true Intel X25-M G3, the third generation controller went missing after briefly appearing on Intel roadmaps. Instead we got mild revisions of the X25-M G2's controller with new features enabled through firmware. This old controller was used in the Intel SSD 320 and more recently in the Intel SSD 710.

One notable characteristic of this old controller was that it never required a large external DRAM (16 - 64MB for the early drives). Intel was proud of the fact that it stored no user data in DRAM, which I always assumed kept the size requirements down. It turns out there was another reason.

All controllers have to map logical block addresses to physical locations in NAND. This map is stored on the NAND itself (and wear leveled so it actually moves locations), but it's cached in DRAM for fast access. Intel calls this map its indirection table.

In the old drives, the indirection table was a binary tree. A binary tree is a data structure made up of nodes and branches where each node can have at most two children.



An example of an LBA-tracking binary tree, Intel's implementation is obviously far more complex. This tree can get huge.

The old indirection table grew in size as the drive was written to. Each node would keep track of a handful of data including logical block address and the physical NAND location that the block mapped to. The mapping wasn't 1:1 so many nodes would refer to a starting LBA address in addition to an offset, allowing a single node to refer to a range of physical locations.

As write requests came in, sequential data was stored as LBA + offset per node in the binary tree. Non-sequential data created a new node, growing the tree, and increasing lookup time. The tree remained balanced (for low-overhead searches, comp sci majors will remember that there's a direct relationship between the height of a binary tree and how long it takes to perform inserts/lookups on the tree), so the creation of new nodes could sometimes be very time intensive.

Given the very small DRAM that Intel wanted on its drives (to help keep costs as low as possible) and the increasing lookup times from managing an ever expanding tree, Intel would regularly defragment/compress the tree. With enough data in the tree you could actually begin compressing various nodes in the tree down into a single node. For example we might have two separate nodes in the tree that refer to sequential physical locations, which can be combined into a single node with location + offset. The tree defrag/compression process would contribute to high latency with random IO.

There was another problem however. The physical NAND had to be defragmented on a regular basis to keep pages contiguous and avoid a random sprinkling of pages on each block (this can negatively impact sequential IO performance if you go to write a large block of data and it either has to be split up amongst multiple randomly distributed blocks, or if you have to erase and rewrite a bunch of blocks to make room for the new data). The problem was that once NAND was defragmented, the logical to physical mapping tree had to be updated to reflect the new mapping, which could sometimes conflict. There could be situations where the tree could just be finished compressing itself, but the NAND would defrag itself forcing a recompression/reorganization of the tree. The fact that both the mapping tree and physical NAND had to be defragmented, and the fact that doing one could create more work for the other contributed to some potentially high latencies in the old design.

The old Intel controller had to defragment both the indirection table and the physical NAND space, and the two processes could conflict, which would create some unexpectedly high latency IO from time to time. On average, Intel was able to keep this under control, but when given the opportunity to start from scratch one major goal was to eliminate this cause of latency.

Introduction & The Drive The New Indirection Table


View All Comments

  • DukeN - Monday, November 05, 2012 - link

    Now please give us some results with benchmarks relevant to enterprise users (eg RAID performance, wear levelling vs other enterprise drives). Reply
  • chrone - Monday, November 05, 2012 - link

    finally getting more consistent performance over time. nice writing Anand, as always! :) Reply
  • edlee - Monday, November 05, 2012 - link

    on paper this is very nice, but i am not having any issues with current crop of ssds.

    how about intel helps design a new sata standard that supports more than 6Gbps, like 50Gbps, so its futureproof, and can put a deathblow to thunderbolt.
  • Conficio - Monday, November 05, 2012 - link

    You realize that
    * Thunderbolt is an Intel technology. So they are not looking to kill it
    * That thunderbolt can rout your entire PCI bus across physical locations (6 m now, with optical cables ~100 m [if memory serves me])
    * That said you want SSD interfaces going directly to the PCI bus (not invent another intermediate bus that is built for a technology (spinning disks))
    * That direct PCI interfaces for SSDs is where things are going
  • dananski - Monday, November 05, 2012 - link

    "...direct PCI interfaces for SSDs is where things are going"

    I would like to see this become more common. There's 8Gb/s of spare PCI-E bandwidth on one slot on my machine at the moment.

    But what if SSDs advance faster than even PCI-E? I wonder if they could bring the interface even closer to home by allowing NAND chips to plug into memory-like slots on the motherboard (yay easy upgrade path), with the controller integrated into the CPU? The controller should be relatively inobtrusive - how much die area would it take at 22nm? And could some of the operations run efficiently on the main CPU to cut down that die area overhead some more?
  • JohnWinterburn - Monday, November 05, 2012 - link

    As much as fusion IO et al would like it to direct PCI interfaces are certainly not where it's going for this market.

    You cant replace them easily when they break (as some are always going to when you have enough), you cant fit that many in a box, you have to rely on a single manufacturer and you're then tied into their software.

    None of thats going to change any time soon, so PCI interfaced SSDs will be small scale or for specific projects.
  • ogreslayer - Monday, November 05, 2012 - link

    That is what SATA express and SFF-8639 will be for and was announced a while ago.


    Maybe not 50Gbps but at 4GB/s and providing 32Gbps it isn't a small jump. Even the 2Gbps gen3 connection isn't something to sneeze at.
  • iwod - Monday, November 05, 2012 - link

    I still fail to understand why we need SATA express and SFF-8639. When one could have ruled them all. Since the main difference between SATA and SAS is one being Half Duplex and SAS being Full Duplex. But the under lying PCI-Express protocol is Full Duplex by design, so why make another SATA express and not just use SFF-8639 ?

    And I hope we start with PCI-E 3.0 too, by the time these things arrive there is no point of using the older and slower PCI-E 2.0
  • Kevin G - Monday, November 05, 2012 - link

    Look into SATA-Express. It essentially uses two PCI-E 2.0 lanes for data transfer (16 Gbit/s with 32 Gbit/s when the spec migrates over to PCI-E 3.0). There is some backwards compatibility with SATA too.

    Though SATA-Express will likely coexist with Thunderbolt. SATA Express is aimed as an internal storage solution where as Thunderbolt is aimed toward external peripherals (where storage is just one aspect).
  • Kevin G - Monday, November 05, 2012 - link

    I'm curious about the raw depth of ECC in this device. ECC on the internal SRAM is pretty much expected for enterprise grade equipment nowadays. ECC on the DRAM is also expected but I'm wondering how it is implemented. Chances are that the drive doesn't house 9 DRAM chips for traditional 72 bit wide ECC protected bus. ECC on the NAND could be implemented at the block level (576 bit blocks with 512 bit data + 64 bit ECC) but that'd require some custom NAND chips.

    As for the indirect tables, I suspect that the need to be able to hold the entire table in DRAM stems from the idea of table having to optimize the copy in NAND. Optimizing here can likely be done without the massive DRAM cache but I suspect that the optimization process would require too many read/writes to the point it'd be detrimental to the drives life span.

Log in

Don't have an account? Sign up now