Using Linux Control Groups to Constrain Process Memory

Linux Control Groups (cgroups) are a nifty way to limit the amount of resource, such as CPU, memory, or IO throughput, that a process or group of processes may use. Frits Hoogland wrote a great blog demonstrating how to use it to constrain the I/O a particular process could use, and was the inspiration for this one. I have been doing some digging into the performance characteristics of OBIEE in certain conditions, including how it behaves under memory pressure. I'll write more about that in a future blog, but wanted to write this short blog to demonstrate how cgroups can be used to constrain the memory that a given Linux process can be allocated.

This was done on Amazon EC2 running an image imported originally from Oracle’s OBIEE SampleApp, built on Oracle Linux 6.5.

$ uname -a  
Linux demo.us.oracle.com 2.6.32-431.5.1.el6.x86_64 #1 SMP Tue Feb 11 11:09:04 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

First off, install the necessary package in order to use them, and start the service. Throughout this blog where I quote shell commands those prefixed with # are run as root and $ as non-root:

# yum install libcgroup  
# service cgconfig start

Create a cgroup (I’m shamelessly ripping off Frits’ code here, hence the same cgroup name ;-) ):

# cgcreate -g memory:/myGroup

You can use cgget to view the current limits, usage, & high watermarks of the cgroup:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0

For more information about the field meaning see the doc here.

To test out the cgroup ability to limit memory used by a process we’re going to use the tool stress, which can be used to generate CPU, memory, or IO load on a server. It’s great for testing what happens to a server under resource pressure, and also for testing memory allocation capabilities of a process which is what we’re using it for here.

We’re going to configure cgroups to add stress to the myGroup group whenever it runs

 $ cat /etc/cgrules.conf  
*:stress memory myGroup

[Re-]start the cg rules engine service:

# service cgred restart

Now we’ll use the watch command to re-issue the cgget command every second enabling us to watch cgroup’s metrics in realtime:

# watch --interval 1 cgget -g memory:/myGroup  
/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 0  
        mapped_file 0  
        pgpgin 0  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 0  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 0  
        total_mapped_file 0  
        total_pgpgin 0  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 0  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0    

In a separate terminal (or even better, use screen!) run stress, telling it to grab 150MB of memory:

$ stress --vm-bytes 150M --vm-keep -m 1

Review the cgroup, and note that the usage fields have increased:

/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 157548544  
memory.memsw.usage_in_bytes: 157548544  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 157343744  
        mapped_file 0  
        pgpgin 38414  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 157343744  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 157343744  
        total_mapped_file 0  
        total_pgpgin 38414  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 157343744  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 157548544  
memory.usage_in_bytes: 157548544

Both memory.memsw.usage_in_bytes and memory.usage_in_bytes are 157548544 = 150.25MB

Having a look at the process stats for stress shows us:

$ ps -ef|grep stress  
oracle   15296  9023  0 11:57 pts/12   00:00:00 stress --vm-bytes 150M --vm-keep -m 1  
oracle   15297 15296 96 11:57 pts/12   00:06:23 stress --vm-bytes 150M --vm-keep -m 1  
oracle   20365 29403  0 12:04 pts/10   00:00:00 grep stress

$ cat /proc/15297/status

Name:   stress  
State:  R (running)  
[...]  
VmPeak:   160124 kB  
VmSize:   160124 kB  
VmLck:         0 kB  
VmHWM:    153860 kB  
VmRSS:    153860 kB  
VmData:   153652 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       328 kB  
VmSwap:        0 kB  
[...]

The man page for proc gives us more information about these fields, but of particular note are:

  • VmSize: Virtual memory size.
  • VmRSS: Resident set size.
  • VmSwap: Swapped-out virtual memory size by anonymous private pages

Our stress process has a VmSize of 156MB, VmRSS of 150MB, and zero swap.

Kill the stress process, and set a memory limit of 100MB for any process in this cgroup:

# cgset -r memory.limit_in_bytes=100m myGroup

Run cgset and you should see the see new limit. Note that at this stage we’re just setting memory.limit_in_bytes and leaving memory.memsw.limit_in_bytes unchanged.

# cgget -g memory:/myGroup|grep limit|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600    

Let’s see what happens when we try to allocate the memory, observing the cgroup and process Virtual Memory process information at each point:

  • 15MB:
     $ stress --vm-bytes 15M --vm-keep -m 1  
    stress: info: [31942] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
    
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 15990784  
    memory.usage_in_bytes: 15990784
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep VmVmPeak:    21884 kB  
    VmSize:    21884 kB  
    VmLck:         0 kB  
    VmHWM:     15616 kB  
    VmRSS:     15616 kB  
    VmData:    15412 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:        60 kB  
    VmSwap:        0 kB     
    
  • 50MB:
     $ stress --vm-bytes 50M --vm-keep -m 1  
    stress: info: [32419] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
    
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 52748288  
    memory.usage_in_bytes: 52748288     
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
    VmPeak:    57724 kB  
    VmSize:    57724 kB  
    VmLck:         0 kB  
    VmHWM:     51456 kB  
    VmRSS:     51456 kB  
    VmData:    51252 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:       128 kB  
    VmSwap:        0 kB
    
  • 100MB:
    $ stress --vm-bytes 100M --vm-keep -m 1  
    stress: info: [20379] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd        
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 105197568  
    memory.usage_in_bytes: 104738816
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
    VmPeak:   108924 kB  
    VmSize:   108924 kB  
    VmLck:         0 kB  
    VmHWM:    102588 kB  
    VmRSS:    101448 kB  
    VmData:   102452 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:       232 kB  
    VmSwap:     1212 kB
    

Note that VmSwap has now gone above zero, despite the machine having plenty of usable memory:

# vmstat -s  
     16330912  total memory  
     14849864  used memory  
     10583040  active memory  
      3410892  inactive memory  
      1481048  free memory  
       149416  buffer memory  
      8204108  swap cache  
      6143992  total swap  
      1212184  used swap  
      4931808  free swap        

So it looks like the memory cap has kicked in and the stress process is being forced to get the additional memory that it needs from swap.

Let’s tighten the screw a bit further:

$ stress --vm-bytes 200M --vm-keep -m 1  
stress: info: [21945] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is now using 100MB of swap (since we’ve asked it to grab 200MB but cgroup is constraining it to 100MB real):

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   211324 kB  
VmSize:   211324 kB  
VmLck:         0 kB  
VmHWM:    102616 kB  
VmRSS:    102600 kB  
VmData:   204852 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       432 kB  
VmSwap:   102460 kB

The cgget command confirms that we’re using swap, as the memsw value shows:

# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 209788928  
memory.usage_in_bytes: 104759296

So now what happens if we curtail the use of all memory, including swap? To do this we’ll set the memory.memsw.limit_in_bytes parameter. Note that running cgset whilst a task under the cgroup is executing seems to get ignored if it is below that currently in use (per the usage_in_bytes field). If it is above this then the change is instantaneous:

  • Current state
    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 9223372036854775807  
    memory.memsw.max_usage_in_bytes: 209915904  
    memory.memsw.usage_in_bytes: 209784832  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104775680
    
  • Set the limit below what is currently in use (150m limit vs 200m in use)
    # cgset -r memory.memsw.limit_in_bytes=150m myGroup
    
  • Check the limit – it remains unchanged
    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 9223372036854775807  
    memory.memsw.max_usage_in_bytes: 209993728  
    memory.memsw.usage_in_bytes: 209784832  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104751104
    
  • Set the limit above what is currently in use (250m limit vs 200m in use)
    # cgset -r memory.memsw.limit_in_bytes=250m myGroup
    
  • Check the limit - it’s taken effect
    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 262144000  
    memory.memsw.max_usage_in_bytes: 210006016  
    memory.memsw.usage_in_bytes: 209846272  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104816640
    

So now we’ve got limits in place of 100MB real memory and 250MB total (real + swap). What happens when we test that out?

$ stress --vm-bytes 245M --vm-keep -m 1  
stress: info: [25927] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is using 245MB total (VmData), of which 95MB is resident (VmRSS) and 150MB is swapped out (VmSwap)

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   257404 kB  
VmSize:   257404 kB  
VmLck:         0 kB  
VmHWM:    102548 kB  
VmRSS:     97280 kB  
VmData:   250932 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       520 kB  
VmSwap:   153860 kB

The cgroup stats reflect this:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 262144000  
memory.memsw.max_usage_in_bytes: 257159168  
memory.memsw.usage_in_bytes: 257007616  
[...]  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104849408

If we try to go above this absolute limit (memory.memsw.max_usage_in_bytes) then the cgroup kicks in a stops the process getting the memory, which in turn causes stress to fail:

$ stress --vm-bytes 250M --vm-keep -m 1  
stress: info: [27356] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd  
stress: FAIL: [27356] (415) <-- worker 27357 got signal 9  
stress: WARN: [27356] (417) now reaping child worker processes  
stress: FAIL: [27356] (451) failed run completed in 3s

This gives you an indication of how careful you need to be using this type of low-level process control. Most tools will not be happy if they are starved of resource, including memory, and may well behave in unstable ways.


Thanks to Frits Hoogland for reading a draft of this post and providing valuable feedback.