Overview of Linux Memory Management Concepts: Slabs

Jim Blakey

Note: In the Linux 2.6 kernels, a new cache manager called the slub allocator is available and may replace the slab allocator described here. Either can be used through kernel configuration parameters. This paper was written a while ago and does not contain the most current kernel information. I will write an update soon - jimb

Introduction to slabs and kernel cache

Kernel modules and drivers often need to allocate temporary storage for non-persistent structures and objects, such as inodes, task structures, and device structures. These objects are uniform in size and are allocated and released many times during the life of the kernel. In earlier Unix and Linux implementations, the usual mechanisms for creating and releasing these objects were the kmalloc() and kfree() kernel calls.

However, these used an allocation scheme that was optimized for allocating and releasing pages in multiples of the hardware page size. For the small transient objects often required by the kernel and drivers, these page allocation routines were horribly inefficient, leaving the individual kernel modules and drivers responsible for optimizing their own memory usage.

One solution was to create a global kernel caching allocator which manages individual caches of identical objects on behalf of kernel modules and drivers. Each module or driver can ask the kernel cache allocator to create a private cache of a specific object type. The cache allocator handles growing each cache as needed on behalf of the module, and more importantly, the cache allocator can release unused pages back to the free pool in times when memory is in a crunch. The cache allocator works with the rest of the memory system to maintain a balance between the memory needs of each driver or module and the system as a whole.

The Linux 2.4 kernel implements a caching memory allocator to hold caches (called slabs) of identical objects. This slab allocator is basically an implementation of the "Slab Allocator" as described in UNIX Internals: The New Frontiers by Uresh Vahalia, Prentice Hall, ISBN 0-13-101908-2, with further tweaks based on a The Slab Allocator: An Object-Caching Kernel Memory Allocator, Jeff Bonwick (Sun Microsystems). presented at: USENIX Summer 1994 Technical Conference.

The following is a partial list of caches maintained by the Slab Allocator on a typical Linux 2.4 system. This information comes from the /proc/slabinfo file. Note that most kernel modules have requested their own caches. The columns are cache name, active objects, total number of objects, object size, number of full or partial pages, total allocated pages, and pages per slab.


slabinfo - version: 1.1
kmem_cache            59     78    100    2    2    1
ip_fib_hash           10    113     32    1    1    1
ip_conntrack           0      0    384    0    0    1
urb_priv               0      0     64    0    0    1
clip_arp_cache         0      0    128    0    0    1
ip_mrt_cache           0      0     96    0    0    1
tcp_tw_bucket          0     30    128    0    1    1
tcp_bind_bucket        5    113     32    1    1    1
tcp_open_request       0      0     96    0    0    1
inet_peer_cache        0      0     64    0    0    1
ip_dst_cache          23    100    192    5    5    1
arp_cache              2     30    128    1    1    1
blkdev_requests      256    520     96    7   13    1
dnotify cache          0      0     20    0    0    1
file lock cache        2     42     92    1    1    1
fasync cache           1    202     16    1    1    1
uid_cache              4    113     32    1    1    1
skbuff_head_cache     93     96    160    4    4    1
sock                 115    126   1280   40   42    1
sigqueue               0     29    132    0    1    1
cdev_cache           156    177     64    3    3    1
bdev_cache            69    118     64    2    2    1
mnt_cache             13     40     96    1    1    1
inode_cache         5561   5580    416  619  620    1
dentry_cache        7599   7620    128  254  254    1
dquot                  0      0    128    0    0    1
filp                1249   1280     96   32   32    1
names_cache            0      8   4096    0    8    1
buffer_head        15303  16920     96  422  423    1
mm_struct             47     72    160    2    3    1
vm_area_struct      1954   2183     64   34   37    1
fs_cache              46     59     64    1    1    1
files_cache           46     54    416    6    6    1

             <snip>

Figure 1: Typical list of caches maintained by kernel

Slab Overview

A slab is a set of one or more contiguous pages of memory set aside by the slab allocator for an individual cache. This memory is further divided into equal segments the size of the object type that the cache is managing.

Figure 2: Two page slab with 6 objects

As an example, assume a file-system driver wishes to create a cache of inodes that it can pull from. Through the kmem_cache_create() call, the slab allocator will calculate the optimal number of pages (in powers of 2) required for each slab given the inode size and other parameters. A kmem_cache_t pointer to this new inode cache is returned to the file-system driver.

When the file-system driver needs a new inode, it calls kmem_cache_alloc() with the kmem_cache_t pointer. The slab allocator will attempt to find a free inode object within the slabs currently allocated to that cache. If there are no free objects, or no slabs, then the slab allocator will grow the cache by fetching a new slab from the free page memory and returning an inode object from that.

When the file-system driver is finished with the inode object, it calls kmem_cache_free() to release the inode. The slab allocator will then mark that object within the slab as free and available.

If all objects within a slab are free, the pages that make up the slab are available to be returned to the free page pool if memory becomes tight. If more inodes are required at a later time, the slab allocator will re-grow the cache by fetching more slabs from free page memory. All of this is completely transparent to the file-system driver.

Creation and management of slab pages

Each slab of pages has an associated slab management structure. The slab_t struct is defined in /usr/src/linux/mm/slab.c has the following format:


typedef struct slab_s {
        struct list_head        list;
        unsigned long           colouroff;
        void                    *s_mem;     /* including colour offset */
        unsigned int            inuse;      /* num of objs active in slab */
        kmem_bufctl_t           free;
} slab_t;

Where:

list is a generic linked list structure found in /usr/include/linux/list.h. This type of list structure is used through out the Linux kernel. It contains a prev and next which are used to track of where this slab is being used. The various lists the slab_t can be on is described in the next section.
colouroff is an offset within the slab where the slab_t and allocated objects begin. This is part of the cache coloring described a little later.
s_mem is a pointer to the first object in the slab. The cache objects are contiguous from this point. Any object in the slab can be referenced by (object number * object size) + s_mem
inuse is a counter of the number of objects currently in use.
free is an integer index of the current next free object within the slab.

The slab_t structure is immediately followed by an array of kmem_bufctl_ts. There is one of these for for each object in the slab. The purpose of the kmem_bufctl_t array is to keep track of allocated and free objects within the slab. Implementations of slab allocators often have these structs containing forward/backward pointers, reference counts, pointers to the slabs they're in, etc. However, the Linux 2.4 implementation of virtual memory uses the page struct to keep track of which caches and slabs cache objects belong to, so the kmem_bufctl_t can be very simple. In fact, it is a single integer index. This array, along with the slab_t>-free member in effect implement the cache heap.

Figure 3. Off slab slab_t structure with 8 kmem_bufctl_t structs

The above figure shows the bufctl in use after several slab objects have been allocated and deleted. The next free object to be allocated will be in slot 3. Then free will assume the value in slot 3's bufctl or the value 1. So slot 1 will get the following next object allocated. Etc.

A slab_t structure may reside in the slab it describes or it may be allocated as part of the kmem_cache cache, depending on the size of the objects in the slab. Figure 2 shows an on-slab slab_t structure, where figure 3 shows on off-slab slab_t. The assumption is that if the objects in the cache are large, it is better to place the slab_t off-slab for less fragmentation and better packing of the objects. The rule is if the object size is greater than 1/8th the page size, the slab_t is allocated off-slab

The number of objects per slab (and therefore the number of pages per slab) is calculated when the cache is initially created. The driver or kernel module that asks for the cache only supplies the size of the objects to be cached, and should never have to care about how many objects there are in the cache.

The basic algorithm for calculating the number of objects in a slab is: First see how many objects fit on one page. If the left over is less than 1/8th the total slab size, then the fragmentation (or wastage) is acceptable and we're done.

If there is too much left over empty space on the first try, we try again with a larger slab. Slabs are grown in multiples of powers of 2 of the page size (kept internally as gfporder). So the first try, gfporder is 0 for 1 page, second try gfporder is 1 for 2 pages, next try gets 4 pages, next gets 8 pages, etc.

The actual algorithm used is obviously a little more complex than what I've just described. There are a lot of other factors and limits to take into account, such as whether the slab_t is kept on the slab (so its size figures into the equation), cache coloring, L1 cache alignment of the objects to reduce cache hits, etc. But the general idea is to find a balance of the number of pages in the slab and the number of objects in the slab for most efficient use of the space available.

The cache

A cache is a group of one or more slabs of object type. The structure that maintains each cache is the kmem_cache_t. The following is a partial list of important fields within the kmem_cache_t structure. This is not a complete structure, as all SMP and statistics related fields were removed for brevity. The full structure is defined in /usr/src/linux/mm/slab.c.


struct kmem_cache_s {
        struct list_head        slabs_full;
        struct list_head        slabs_partial;
        struct list_head        slabs_free;
        unsigned int            objsize;
        unsigned int            flags;  /* constant flags */
        unsigned int            num;    /* # of objs per slab */
        spinlock_t              spinlock;
        ...
        unsigned int            gfporder;
        unsigned int            gfpflags;
        size_t                  colour;         /* cache colouring range */
        unsigned int            colour_off;     /* colour offset */
        unsigned int            colour_next;    /* cache colouring */
        ...
        kmem_cache_t            *slabp_cache;
        unsigned int            growing;
        ...
        void (*ctor)(void *, kmem_cache_t *, unsigned long);
        void (*dtor)(void *, kmem_cache_t *, unsigned long);
        char                    name[CACHE_NAMELEN];
        struct list_head        next;
        ...
};

Where: Where:

slabs_full, slabs_partial,and slabs_free are lists of slabs associated with this cache.
objsize is the size of the objects contained within the cache.
num is the number of cache objects per slab. This is calculated when the cache is created as a function of the size of each object and the best fit for a given number of memory pages per slab.
gfporder is the number of pages per slab as a power of two. For example, for 1 page/slab, gfporder is 0, for 2 pages/slab, gfporder is 1, for 4 pages, 2, for 8 pages, 3, etc... Pages are always allocated in multiples of powers of two.
colour, colour_off and colour_next are used to maintain the cache coloring offset for each slab. This will be discussed later.
*slabp_cache is a pointer to the kernel cache that is used for the slab_t, if the slab_t is maintained off-slab.
growing is a flag that the cache is growing so that the memory pages won't be de-allocated.
ctor and dtor are driver callback routines. These are constructor and destructor routines that the driver can supply that will be called when an object is allocated or released. This allows a driver to have a custom initialization or validation routine on object allocation, or to perform any cleanup before an object is released. These are also very useful for driver debugging.

name is a string describing the cache. This will be printed as part of the /proc/slabinfo file. See Slab Figure 1.
next is a list_struct pointer to the next cache in the kernel cache chain.

Figure 4. kmem_cache_t and its associated slabs

TODO: Add a couple of paragraphs describing use here

Cache Coloring of Slabs

Cache Coloring is a method to ensure that access to the slabs in kernel memory make the best use of the processor L1 cache. This is a performance tweak to try to ensure that we take as few cache hits as possible. Since slabs begin on page boundaries, it is likely that the objects within several different slab pages map to the same cache line, called 'false sharing'. This leads to less than optimal hardware cache performance. By offsetting each beginning of the first object within each slab by some fragment of the hardware cache line size, processor cache hits are reduced.

Figure 5. Cache Coloring within slabs

The `kmalloc()` Interface

Of course, there are times when a kernel module or driver needs to allocate memory for an object that doesn't fit one of the uniform types of the other caches, for example string buffers, one-off structures, temporary storage, etc. For those instances drivers and kernel modules use the kmalloc() and kfree() routines. The Linux kernel ties these calls into the slab allocator too.

On initialization, the kernel asks the slab allocator to create several caches of varying sizes for this purpose. Caches for generic objects of 32, 64, 128, 256, all the way to 131072 bytes are created for both the GFP_NORMAL and GFP_DMA zones of memory.

Figure 6: Use of the cache_sizes array

When a kernel module or driver needs memory, the cache_sizes array is searched to find the cache with the size appropriate to fit the requested object. For example, if a driver requests 166 bytes of GFP_NORMAL memory through kmalloc(), an object from the 256 byte cache would be returned.

When kfree() is called to release the object, the page the object resides in is calculated. Then the page struct for that page is referenced from mem_map (which was set up to point to our kmem_cache_t and slab_t pointers when the slab was allocated). Since we now have the slab and cache for the object, we an release it with __kmem_cache_free().

TODO: Add a summary paragraph here