HPE Memory RAS; Excels at being Average

A recent HPE blog stating memory errors are not the end of the world was meant to reassure clients to accept regular & unplanned platform disruptions. In reality what HPE ends up saying is there is little difference with the other commercial Intel server vendors and their own as they all range from below average to average at best.  Just so happens, this specific blog was written by the HPE Server Memory Product Manager who might be forgiven for painting this dire picture only to then present the best alternative; Yes, HPE SmartMemory. *shock*

To HPE’s credit, they have quite a bit of documentation discussing server Reliability, Availability & Serviceability (RAS) features, specifically about their memory subsystem. They are fairly forthright about their strengths and weaknesses of the entry, mid-range and high-end servers. Sadly though, at every level there message is full of qualifiers, limitations and restrictions which require the consumer to wade through and understand all of the requirements.

An HPE whitepaper from February 2016 titled “How memory RAS technologies can enhance the uptime of HPE ProLiant servers” paints a starkly different picture than the blog. The whitepaper states on page 2 in the 2nd paragraph of the introductory summary section “It might surprise you to know that memory device failures are far and away the most frequent type of failure for scale-up servers.“, up to 2X the rate of the next closest part when the memory is configured with a memory protection configuration not better than SDDC+1.  There is another graph that immediately follows this one showing when memory is configured using a protection scheme of DDDC+1 it decreases memory failures by 85%. That is pretty good, yet the value of 85% used in the whitepaper does not jive with the blog which states when using HPE SmartMemory, memory errors are reduced 99.9998% (yes, that is 5 x 9’s).  I call out this discrepancy because right after claiming 5×9’s they point the reader to the very whitepaper I am citing here.

This blog is not meant to define all of the different terms used, you will have to do some of that work. However, it is worth noting that all of the wonderful features touted in the HPE blog, in the HPE whitepaper and may other sources, the consumer will find there are many qualifiers, limitations and restrictions.  Such as.

  1. E5 chips do not support DDDC or DDDC+1
  2. E5 chips only support SDDC or SDDC + rank sparing
  3. Memory sparing consumes (wastes) either 25% or 12.5% of installed capacity
  4. EX chips support SDDC, SDDC + rank sparing, SDDC+1 and DDDC+1
  5. But, DDDC+1 is ONLY using x4 DIMMs and not x8 DIMMs
  6. DDDC+1 requires x4 DIMMs
  7. Advanced ECC is an option used across 2 DIMMs but can only fill 2 of 3 DIMM slots per channel
  8. Memory Mirroring is the most expensive in terms of cost & performance
  9. Memory Mirroring wastes 1/2 of the DIMM slots for the mirror – not usable
  10. Memory Mirroring only allows you to fill 2 of 3 DIMM slots per channel
  11. Memory Mirroring has a potential performance impact for WRITES

Let’s be clear, consumers have 3 primary options to configure memory on any of the Intel servers.

  1. Performance mode which delivers the highest bandwidth with the lowest reliability features. Not an ideal option for in-memory workloads despite the appeal to maximize the bandwidth.
  2. Lockstep Mode meant to strike a balance of slightly decreased bandwidth (can be up to 50%¹) while increasing reliability over performance mode.  Probably the most common option selected.
  3. Memory Mirroring Mode delivers the highest reliability at the expense of wasting 1/2 the memory capacity as well has a slight performance decrease (remember, this mode can only use 2 of the 3 DIMM slots per channel so you already lose 1/3 of the memory capacity).

What is HPE’s response to clients who want increased memory RAS; especially for those in-memory workloads such as SAP HANA?  Buy more expensive E7 based servers to receive slightly higher memory RAS capability OR install more memory on the already RAS-deficient E5 based servers to increase its capacity to utilize memory spare ranks.

Net-net is that HPE is pushing proprietary memory that is far more expensive than the industry standard memory traditionally used with Intel servers that has earned it the reputation as a low-cost leader relative to traditional Enterprise-class systems like IBM POWER or SPARC. That is evident in the SAP HANA space as the systems required to support these in-memory workloads tend to require more capacity; more cores to achieve the core to memory ratio’s and more sockets to achieve more memory capacity with its associated bandwidth.  Yet, HPE remains true to form as regardless of the path taken, it comes with increased cost, limitations, restrictions and qualifications.

Contrast the never-ending “Compromise” Intel options, IBM’s POWER8 servers use Enterprise memory that is “No Compromise”.  This buffered memory offers spare  capacity, spare lanes, memory instruction replay, chipkill and an incredible DDDC +1+1 allowing for multiple DRAM failures before experiencing a system event.  The design point for POWER8 memory is simple: Not to fail!

AS you consider platforms to host in-memory workloads such as SAP HANA, DB2 BLU, consider which basket you want to place all of your eggs into.  A platform with a memory subsystem designed not to fail or a platform with unending limitations as listed above. The choice should be easy – Choose POWER!