HPE Memory RAS; Excels at being Average

A recent HPE blog stating memory errors are not the end of the world was meant to reassure clients to accept regular & unplanned platform disruptions. In reality what HPE ends up saying is there is little difference with the other commercial Intel server vendors and their own as they all range from below average to average at best.  Just so happens, this specific blog was written by the HPE Server Memory Product Manager who might be forgiven for painting this dire picture only to then present the best alternative; Yes, HPE SmartMemory. *shock*

To HPE’s credit, they have quite a bit of documentation discussing server Reliability, Availability & Serviceability (RAS) features, specifically about their memory subsystem. They are fairly forthright about their strengths and weaknesses of the entry, mid-range and high-end servers. Sadly though, at every level there message is full of qualifiers, limitations and restrictions which require the consumer to wade through and understand all of the requirements.

An HPE whitepaper from February 2016 titled “How memory RAS technologies can enhance the uptime of HPE ProLiant servers” paints a starkly different picture than the blog. The whitepaper states on page 2 in the 2nd paragraph of the introductory summary section “It might surprise you to know that memory device failures are far and away the most frequent type of failure for scale-up servers.“, up to 2X the rate of the next closest part when the memory is configured with a memory protection configuration not better than SDDC+1.  There is another graph that immediately follows this one showing when memory is configured using a protection scheme of DDDC+1 it decreases memory failures by 85%. That is pretty good, yet the value of 85% used in the whitepaper does not jive with the blog which states when using HPE SmartMemory, memory errors are reduced 99.9998% (yes, that is 5 x 9’s).  I call out this discrepancy because right after claiming 5×9’s they point the reader to the very whitepaper I am citing here.

This blog is not meant to define all of the different terms used, you will have to do some of that work. However, it is worth noting that all of the wonderful features touted in the HPE blog, in the HPE whitepaper and may other sources, the consumer will find there are many qualifiers, limitations and restrictions.  Such as.

  1. E5 chips do not support DDDC or DDDC+1
  2. E5 chips only support SDDC or SDDC + rank sparing
  3. Memory sparing consumes (wastes) either 25% or 12.5% of installed capacity
  4. EX chips support SDDC, SDDC + rank sparing, SDDC+1 and DDDC+1
  5. But, DDDC+1 is ONLY using x4 DIMMs and not x8 DIMMs
  6. DDDC+1 requires x4 DIMMs
  7. Advanced ECC is an option used across 2 DIMMs but can only fill 2 of 3 DIMM slots per channel
  8. Memory Mirroring is the most expensive in terms of cost & performance
  9. Memory Mirroring wastes 1/2 of the DIMM slots for the mirror – not usable
  10. Memory Mirroring only allows you to fill 2 of 3 DIMM slots per channel
  11. Memory Mirroring has a potential performance impact for WRITES

Let’s be clear, consumers have 3 primary options to configure memory on any of the Intel servers.

  1. Performance mode which delivers the highest bandwidth with the lowest reliability features. Not an ideal option for in-memory workloads despite the appeal to maximize the bandwidth.
  2. Lockstep Mode meant to strike a balance of slightly decreased bandwidth (can be up to 50%¹) while increasing reliability over performance mode.  Probably the most common option selected.
  3. Memory Mirroring Mode delivers the highest reliability at the expense of wasting 1/2 the memory capacity as well has a slight performance decrease (remember, this mode can only use 2 of the 3 DIMM slots per channel so you already lose 1/3 of the memory capacity).

What is HPE’s response to clients who want increased memory RAS; especially for those in-memory workloads such as SAP HANA?  Buy more expensive E7 based servers to receive slightly higher memory RAS capability OR install more memory on the already RAS-deficient E5 based servers to increase its capacity to utilize memory spare ranks.

Net-net is that HPE is pushing proprietary memory that is far more expensive than the industry standard memory traditionally used with Intel servers that has earned it the reputation as a low-cost leader relative to traditional Enterprise-class systems like IBM POWER or SPARC. That is evident in the SAP HANA space as the systems required to support these in-memory workloads tend to require more capacity; more cores to achieve the core to memory ratio’s and more sockets to achieve more memory capacity with its associated bandwidth.  Yet, HPE remains true to form as regardless of the path taken, it comes with increased cost, limitations, restrictions and qualifications.

Contrast the never-ending “Compromise” Intel options, IBM’s POWER8 servers use Enterprise memory that is “No Compromise”.  This buffered memory offers spare  capacity, spare lanes, memory instruction replay, chipkill and an incredible DDDC +1+1 allowing for multiple DRAM failures before experiencing a system event.  The design point for POWER8 memory is simple: Not to fail!

AS you consider platforms to host in-memory workloads such as SAP HANA, DB2 BLU, consider which basket you want to place all of your eggs into.  A platform with a memory subsystem designed not to fail or a platform with unending limitations as listed above. The choice should be easy – Choose POWER!

 

SAP HANA – could I have extra complexity please?

Just returned from IBM’s Systems Technical University conference held in Orlando having delivered presentations on 4 different topics.

  1. Benefits of SAP HANA on POWER vs Intel
  2. Why IBM POWER systems are datacenter leaders
  3. Only platform that controls Software Licensing
  4. Why DB2 beats Oracle on POWER (implied that it beats Intel).

With the SAP Sapphire conference last week in Orlando, there was a slew of announcements.  Quick reminder for the uninitiated with SAP HANA, that it is ONLY supported on Intel and POWER based systems running one OS; SUSE or RedHat Linux. With that, IBM POWER continues to deliver the best value.

What is the value offered with the POWER stack? Flexibility! It really is that simple.  If I had a mic on the plane as I write this, I would drop it. Conversely, what is the value offered going with an Intel stack? Compromise!

Some of the flexibility offered thru IBM POWER systems are: Scale-up, scale-out, complete virtualization, grow, shrink, move, perform concurrent maintenance, mix workloads: existing ECC workloads on AIX or IBM i with new HANA running Linux all on the same server.  All of this runs using the most resilient HANA platform available.

Why do I label Intel systems as “Compromise” solutions? It isn’t a competitive shot nor FUD.  Listen, as an Client Executive and Executive Architect for an Channel Reseller, I am able to offer my clients solutions from multiple vendors that include IBM POWER and Intel based systems manufacturers.  I’ve made the conscious decision though to promote IBM POWER over Intel.  Why? Because I not only believe in the capabilities of the platform but also having worked with some of the largest companies in the world, I regularly hear and see the impact running Enterprise workloads on Intel based servers has on the business.

If you read my previous blog, I mention a client who just recently moved their Oracle workloads from POWER to Intel.  Within months, they’ve had to buy over $5M in new licenses going from a simple standalone and a few 2-node clusters (all on the same servers) to an 8-node VMware based Oracle RAC cluster.  This environment is having daily stability issues significantly impacting their business.  Yes, their decision to standardize on a single platform has introduced complexity to the business costing them money, resources (exhausted & not having the proper skills to manage the complexity) that impacts their end-users.

The “Compromise” I mention to host SAP HANA on Intel is that everything has to be an asterisk by it – in other words a limitation or restriction – everything requires follow-up questions and research to ensure what the business wants to do, can be done. Here are some examples.
1) VMware vSphere 5.5 initially supported 1 VM per system which has now been increased to 4 VM’s, but with many qualifications.
a) Restricted to 2 & 4 socket Intel servers
1) VM’s are limited to a socket
2) 2 socket server ONLY supports 2 VM’s, 4 socket would be 4 x 1 sockets each
b) Only E5_v2, E5_v3, E7_v2 and E7_v3 chips are supported – NO Broadwell
c) Want to redeploy capacity for other? Appliances certified only for SoH or S4H
uses cannot be used for other purposes such as BW
d) Did I mention, those VM’s are also limited to 64 vCPU and 1 TB of memory each
e) If a VM needs more memory than what is attached to that socket? No problem, you have to add an additional socket and all of its memory – no sharing!
2) VMware vSphere 6.0 just recently went from 1 to 16 VM’s per system.
a) VM’s are still limited to a socket or 1/2 socket.
b) 1/2 socket isn’t as amazing as it sounds.  Since vSphere supports 2, 4 & 8 socket servers, there can be 16 x 1/2 socket VM’s.
c) What there cannot be, is any combination of VM’s >1 socket with 1/2 socket assigned. In other words, a VM cannot have 1.5 or 3.5 sockets. Any VM resource requirement above 1 socket requires the addition of an entire socket.  1.5 sockets would be 2 sockets.
d) Multi-node setups are NOT permitted …. at all!
e) VM’s larger than 2 sockets cannot use Ivy Bridge based systems, only Haswell or Broadwell chips – but ONLY on 4-socket servers.  Oh my gosh, this is making my head hurt!
f) If using an 8-socket system, it only supports a single production VM using Haswell ONLY processors.  NOT Ivy Bridge and NOT Broadwell!
g) VM’s are limited to 128 vCPU and 4 TB of memory
3) VMware vSphere 6.5 with SAP HANA SPS 12 only supports Intel Broadwell based systems. What if your HANA Appliance is based on Ivy-Bridge or Haswell processor technology? “Where is that Intel rep’s business card? Guess I’ll have to buy another one since I can’t upgrade these”
a) VM’s using >4 sockets are currently NOT supported with these Broadwell chips
b) Now, it gets better. I hope you are writing this down – For 2 OR 8 socket systems, the maximum VM size is 2 sockets.  Only a 4 socket system supports 1 VM with 4 sockets.
c) Same 1/2 socket restrictions as vSphere 6.0.
d) Servers with >8 sockets do NOT permit the use of VMware
e) If your VM requirements exceed 128 vCPU and 4 TB of memory, you must move it to a bare-metal system ….. Call me – I’ll put you on a POWER system where you can scale-up, scale-out without of this mess

Contrast all of these VMware + Intel limitations, restrictions, liabilities, qualification or simply said “Compromise” systems to the IBM Power System.

POWER8 servers run the POWER Hypervisor called PowerVM.  This Hypervisor and its suite of features deliver flexibility allowing all physical, all virtual and a combination of physical & virtual resource usage on each system. Even where there are VM limits such as 4 on the low-end system, that 4 could really be 423 VM’s.  I’m making a theoretical statement here to prove the point. Let’s use a 2 socket 24 core S824 server.  3 VM’s, each with 1 core (yes, I said core) for production usage and the 4th VM’s is really a Shared Processor Pool with 21 cores.  Those 21 cores support up to 20 VM’s per core or 420 VM’s. Any non-production use is permitted.

Each PowerVM VM supports up to 16 TB of memory and 144 cores.  VM size above 108 cores requires the use of SMT4 whereas <=108 cores permit SMT8.  Thus, 144 cores with SMT4 is 576 vCPU’s or 4.5X what Intel can do with 4X the memory footprint.  By the way, that 108 core VM would support 864 vCPU’s – just saying!  Note: I need to verify as the largest SMT8 VM may be 96 cores with only 768 vCPU.

Not only can we allocate physical cores to VM’s and NOT limited to 1/2 or full socket increments like Intel, but POWER systems granularity allows for adjustments at the vCPU level.

PowerVM supports scale-out and scale-up.  Then again, if you have heard or read about the Pfizer story for scale-out BW, you might rethink a literal scale-out approach. Read IBM’s Alfred Freudenberger’s blog on this subject at https://saponpower.wordpress.com/2016/05/26/update-sap-hana-support-for-vmware-ibm-power-systems-and-new-customer-testimonials/

While on the subject of BWoH/B4H, PowerVM supports 6 TB per VM whereas the vSphere 6.0 supports is 3 TB and the limitations increase from here.

Do you see why I choose to promote IBM Power vs Intel? When I walk into a client, the most valuable item I bring with me is my credibility.  HANA on Intel is a constant train wreck with constant changes & gotcha’s. Clients currently with HANA on Intel solutions or better yet, running ECC on Intel have options.  That option is to move to a HANA 2.0 environment using SUSE 12 or RedHat v7 Linux on POWER servers. Each server will host multiple VM’s with greater resiliency providing the business the flexibility desired from the critical business system that likely touches every part of the business.