Follow me on Twitter:

Understanding Linux Virtual Memory

Posted: February 15th, 2010 | Author: | Filed under: Linux / Unix | Tags: , , , , | No Comments »

Virtual memory is one of the most important, and accordingly confusing, piece of an operating system. Understanding the basics of virtual memory is a requisite to understanding operating system performance. Beyond the basics, a deeper understanding allows a systems administrator to interpret system profiling tools better, leading to quicker troubleshooting and better decisions.

The concept of virtual memory is generally taught as though it’s only used for extending the amount of physical RAM in a system. Indeed, paging to disk is important, but virtual memory is used by nearly  every aspect of an operating system.

In addition swapping, virtual memory is used to manage all pages of memory, which incidentally are required for file caching, process isolation, and even network communication. Anything that queues data, you can be assured, traverses the virtual memory system. Depending on a server’s role, virtual memory functionality may not be optimal. An administrator can dramatically improve overall system performance by adjusting certain virtual memory manager settings.

To optimally configure your Virtual Memory Manager (VMM), it’s necessary to understand how it does its job. We’re using Linux for example’s sake, but the concepts apply across the board, though some slight architectural differences will exist between the Unixes.

Nearly every VMM interaction involves the MMU, or Memory Management Unit, excluding the disk subsystem. The MMU allows the operating system to access memory through virtual addresses by using data structures to track these translations. Its main job is to translate these virtual addresses into physical addresses, so that the right section of RAM is accessed.

The Zoned Buddy Allocator interacts directly with the MMU, providing valid pages when the kernel asks for them. It also manages lists of pages and keeps track of different categories of memory addresses.

The Slab Allocator is another layer in front of the Buddy Allocator, and provides the ability to create cache of memory objects in memory. On x86 hardware, pages of memory must be allocated in 4KB blocks, but the Slab Allocator allows the kernel to store objects that are differently sized, and will manage and allocate real pages appropriately.

Finally, a few kernel tasks run to manage specific aspects of the VMM. The bdflush manages block device pages (disk IO), and kswapd handles swapping pages to disk. Pages of memory are either Free (available to allocate), Active (in use), or Inactive. Inactive pages of memory are either dirty or clean, depending on if it has been selected for removal yet or not. An inactive dirty page is no longer in use, but is not yet available for re-use. The operating system must scan for dirty pages, and decide to deallocate them. After they have been guaranteed sync’d to disk, an inactive page my be “clean,” or ready for re-use.

Tunable parameters may be adjusted in real-time via the proc fils system, but to persist across a reboot, /etc/sysctl.conf is the preferred method. Parameters can be entered in real-time via the sysctl command, and then recorded  in the configuration file for reboot persistence.

You can adjust everything from the interval pages are scanned to the amount of memory to reserve for pagecache use. Let’s see a few examples.

Often we’ll want to optimize a system for IO performance. A busy database server, for example, is generally only going to run the database, and it doesn’t matter if the user experience is good or not. If the system doesn’t require much memory for user applications decreasing the available bdflush tunables is beneficial. The specific parameters being adjusted are just too lengthy to explain here, but definitely look into them if you wish to adjust the values further. They are fully explained in vm.txt, usually located at: /usr/src/linux/Documenation/sysctl/vm.txt.

In general, a IO-heavy server will benefit from the following setting these values in sysctl.conf:
vm.bdflush=”100 5000 640 2560 150 30000 5000 1884 2”

The pagecache values control how much memory is used for pagecache. The amount of pagecache allowed translates directly to how many programs and open files can be held in memory.

The three tunable parameters with pagecache are:

  • Min: the minimum amount of memory reserved for pagecache
  • Borrow: the percentage of pages used in the process of reclaiming pages
  • Max: percentage at which kswapd will only page pagecache pages; once it falls below, it can swap out process pages again

On a file server, we’d want to increase the amount of pagecache available, so that data isn’t moved to disk as often. Using vm.pagecache=”10 50 100″ provides more caching, allowing larger and less frequent disk writes for file IO intensive work loads.

On a single-user machine, say your workstation, large number will keep pages in memory, allowing programs to execute faster. Once the upper limit is reached, however, you will start swapping constantly.

Conversely, a server with many users that frequently executes many different programs will not want high amounts of pagecache. The pagecache can easily eat up available memory if it’s too large, so something like vm.pagecache=”10 20 30” is a good compromise.

Finally, the swappiness and vm.overcommit parameters are also very powerful. The overcommit number can be used to allow more memory allocation than RAM exists, which allows you to overcommit the amount of pages. Programs that have a habit of trying to allocate many gigabytes of memory are a hassle, and frequently they don’t use nearly that much memory. Upping the overcommit factor will allow these allocations to happen, but if the application really does use all the RAM, you’ll be swapping like crazy in no time (or worse: running out of swap).

The swappiness concept is heavily debated. If you want to decrease the amount of swapping done by the system, just echo a small number of the range 0-100 into: /proc/sys/vm/swappiness. You don’t generally want to play with this, as it its more mysterious and non-deterministic than the advanced parameters described above. In general, you want applications to swap to avoid them using memory for no reason. Task-specific servers, where you know the amount of RAM and the application requirements, are best suited for swappiness tuning (using a low number to decrease swapping).

These parameters all require a bit of testing, but in the end, you can dramatically increase the performance of many types of servers. The common case of disappointing disk performance stands to gain the most: give the settings a try before going out and buying a faster disk array.

No Comments yet... be the first »