A Common IT Man's Blog: September 2014

When running MongoDB on a Linux server (having multiple physical processors as most servers have these days), you will see the below warning in your log file on startup:

2014-09-04T13:13:35.245-0400 [initandlisten] ** WARNING: You are running on a NUMA machine.
2014-09-04T13:13:35.245-0400 [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2014-09-04T13:13:35.245-0400 [initandlisten] **              numactl --interleave=all mongod [other options]

This check for NUMA based system is done on startup since MongoDB2.0 as running MongoDB or for that matter any database applications can cause poor performance due to excessive swapping.

There is no ideal solution to this problem yet, but a good solution would be to switch off zone reclaim and set a memory interleave policy before starting mongod process. Here are the two steps, both of which need to be done to fully disable NUMA:

Switch off zone reclaim:
echo 0 > /proc/sys/vm/zone_reclaim_mode
Flush Linux buffer caches just before starting mongod:
sysctl -q -w vm.drop_caches=3
Set interleave policy for mongod process when starting it:
numactl --interleave=all <$MONGO_HOME>/mongod <options>

This should solve your problem - but if you want to understand what the issue is and what's really going on - then continue to read below!

Older multi-processor systems were designed to have Uniform Memory Access (UMA) - this essentially means is that all memory is common to all processors and accessed via a common bus. Thus the latency and performance of all processors was the same. This is also referred to as Symmetric Multi-Processing (SMP).

New multi-processor systems were designed to have Non-Uniform Memory Access (NUMA) - each processor has a local bank of memory to which it has very fast access & low latency and it also has access to memory banks of other processors though with a poorer performance & higher latency. This is sometimes also referred to as Cache-Coherent NUMA (ccNUMA).

Linux is aware of its running on a NUMA system and does the following:

Probes the system to figure out the physical layout
Attaches the memory module to its' corresponding processor to create a node
Creates a map (table/matrix) of the cost or weight of inter-node communication between all the nodes
Allocates a preferred node for each thread to run on and allocates memory to it, preferably from the same node or from another node if available. If memory is not available on the same node, it is allocated from the available node with the lowest cost
Uses zone reclaim to reclaim memory when a node runs out of memory

Zone Reclaim allows Linux SysAdmins to set a more or less aggressive approach to reclaim memory when a zone runs out of memory. If it is set to zero, which is the default, then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system.

Now that you have a simplistic view of what NUMA is, let's understand what our solution actually did.

We switched off zone reclaim - this would reduce paging as allocation of memory is now satisfied from other nodes if required
Flushed Linux's buffer caches - This helps to ensure allocation fairness, even if the daemon is restarted while significant amounts of data are in the operating system buffer cache.
We set memory allocation to interleave for mongod - this spreads allocation of memory evenly on a round-robin basis across all nodes thus spreading out memory usage and reducing paging

For a more in-depth analysis and further reading, check out Jeremy Cole's blog posts:

The MySQL “swap insanity” problem and the effects of the NUMA architecture
A brief update on NUMA and MySQL
Zone Reclaim Mode (search for zone_reclaim on this page)

A Common IT Man's Blog

Friday, September 5, 2014

MongoDB warning for running on NUMA machine

LinkWithin