Impact of oversized virtual machines… my part

Oversized VMs is a challenge that ends up in your infrastructure, most of the time based on recommendations from application vendors. Their goal is to ensure that the end-user who purchase their software will get the greatest end-user experience and that the layers underneath the application won’t interfere with their application. Their goal is legitimate but most of the time running the application inside a virtual infrastructure is not accounted for, and that should not be their concern to address neither.

Frank Denneman has an excellent serie on the topic of oversized VM : . The goal of the present article is not to review the technical details, there is a lot of article out there that describes in detail the memory management process better than I could. The goal is to bring the focus on the level of collaboration between the different layers in the cloud era and also highlight some undesired and unexpected downside to oversized VM.

I was recently asked to review a virtual machine that did show some ballooning. Ballooning investigations starts at the host level as this is a host response to a change in the memory state. Fair enough, other virtual machines on the host were also targeted by the reclamation process as shown below:

This brings another item which can play against you. The ballooning reclamation process choose which VM and assign a balloon target based on active memory %. Virtual machines with lower levels of active memory will be the most targeted. There is pretty good chance that the over sized virtual machines will be amongst your critical workload. They might end up being the most impacted by the balloon driver.

The next step was to review the memory activity on the host to identify which virtual machine(s) created the change in memory state that triggered the memory reclamation process. With all its caveats, this is where Active Memory can become extremely useful. Active Memory does not provide any final answer due to its nature but can surely put you on the right track during your investigation. Looking at the active memory for all virtual machines on that host the graph clearly shown that a single VM had a significant increase and sustain memory activity:

The chain of events is suddenly rather self-explanatory:

  1. A single VM with an average of Active Memory under 1GB suddenly started to use actively all of its 50GB.
  2. ESX needs to allocate memory pages that it might not have due to overcommitment.
  3. The overall host memory state changed accordingly.
  4. Consequently, reclamation through ballooning got triggered on multiple VMs on the same host.
  5. End user noticed performance degradation
  6. vCenter/DRS keeping an eye on the cluster, initiated some migrations.
  7. Balloon driver deflated
  8. …back to normal

As the engineer, shall I ensure that those variations can occur without impacting the other VMs? Yes, one of my primary objective is to ensure a great end-user experience. This could be a potential candidate for memory reservation.

Unless I ask the question: what was that memory used for… With all the monitoring tools I do have access to inside my “layer”, I still have no insight in regards of what is this memory being used for. As this is a Linux VM, I requested a SAR report to review.

Two graphs caught my attention, the first one MEMUSED clearly show that from an OS stand point almost all the memory is being used (93%)

The second graph, CACHED  answered the question…

What was the memory being used for? Over the time period covered by the report, 46GB of ram was used as cache for IO reads. When we noticed the problem at the virtual layer, the cache level dropped to 40gigs. So, due to its nature Linux will used unclaimed memory as cache (pretty smart if you ask me). As processes request memory pages the kernel discard some older elements from the cache and grants the memory pages to the processes.

After looking at the backup schedule, I had the big picture.  The application running on this VM is not using the memory it has access to.The backup agent started, requested some memory pages explaining the 4GB dip in cache memory.  Backups process generated lots of IO reads and as Linux has a lot of unused memory it will leverage it as cache. This is what created the cascade of events in the virtual infrastructure.

Before knowing what was the memory being used for, putting in place memory reservation could have seem appropriate and it would be if the memory was used by a core application process. But putting memory reservation in place for IO cache…no comments 😉

The interesting thing with what was depicted above is that everything worked as it should. No malfunction or failure was experienced but the end-user experienced a noticeable performance degradation.

Presenting sizing recommendation based on the active memory can be tricky. Active Memory does not provide fact as it is a statistical sampling calculations from ESX. By digging inside the OS and finding facts help build a solid case. In the present scenario, linux and the cache memory hold the answer. Each OS and application will have a set metrics matching different behaviors.

This is one example that brings the focus on the level of collaboration required across all layers. At the end of the line all layers provides a single service.