Chapter 7. Performance and Scalability Enhancements

Table of Contents

7.1. Overview
7.2. Faster Locking for Improved Scalability
7.3. Using Operating System Memory Allocators
7.4. Controlling InnoDB Insert Buffering
7.5. Controlling Adaptive Hash Indexing
7.6. Changes Regarding Thread Concurrency
7.7. Changes in the Read Ahead Algorithm
7.8. Multiple Background I/O Threads
7.9. Group Commit
7.10. Controlling the Master Thread I/O Rate
7.11. Controlling the Flushing Rate of Dirty Pages
7.12. Using the PAUSE instruction in InnoDB spin loops
7.13. Control over spin lock polling

7.1. Overview

InnoDB has always been highly efficient, and includes several unique architectural elements to assure high performance and scalability. However, modern operating system and hardware platforms now offer capabilities that did not exist when InnoDB was first developed. InnoDB Plugin 1.0.4 includes several new features that take better advantage of such advances as systems with multi-core processors and improved memory allocation systems. In addition, this release permits you to better control the use of some InnoDB internal subsystems to achieve the best performance with your workload.

InnoDB Plugin 1.0.4 includes new capabilities in these areas:

7.2. Faster Locking for Improved Scalability

In MySQL and InnoDB, multiple threads of execution access shared data structures. InnoDB synchronizes these accesses with its own implementation of mutexes and read/write locks. InnoDB has historically protected the internal state of a read/write lock with an InnoDB mutex. On Unix and Linux platforms, the internal state of an InnoDB mutex is protected by a Pthreads mutex, as in IEEE Std 1003.1c (POSIX.1c).

On many platforms, there is a more efficient way to implement mutexes and read/write locks. Atomic operations can often be used synchronize the actions of multiple threads more efficiently than Pthreads. Each operation to acquire or release a lock can be done in fewer CPU instructions, and thus result in less wasted time when threads are contending for access to shared data structures. This in turn means greater scalability on multi-core platforms.

Beginning with InnoDB Plugin 1.0.3, InnoDB implements mutexes and read/write locks with the built-in functions provided by the GNU Compiler Collection (GCC) for atomic memory access instead of using the Pthreads approach previously used. More specifically, an InnoDB Plugin that is compiled with GCC version 4.1.2 or later will use the atomic builtins instead of a pthread_mutex_t to implement InnoDB mutexes and read/write locks.

On 32-bit Microsoft Windows, InnoDB has implemented mutexes (but not read/write locks) with hand-written assembler instructions. Beginning with Microsoft Windows 2000, it is possible to use functions for Interlocked Variable Access that are similar to the built-in functions provided by GCC. Beginning with InnoDB Plugin 1.0.4, InnoDB makes use of the Interlocked functions on Windows. Unlike the old hand-written assembler code, the new implementation supports read/write locks and 64-bit platforms.

Solaris 10 introduced library functions for atomic operations. Beginning with InnoDB Plugin 1.0.4, when InnoDB is compiled on Solaris 10 with a compiler that does not support the built-in functions provided by the GNU Compiler Collection (GCC) for atomic memory access, the library functions will be used.

This change improves the scalability of InnoDB on multi-core systems. Note that the user does not have to set any particular parameter or option to take advantage of this new feature. This feature is enabled out-of-the-box on the platforms where it is supported. On platforms where the GCC, Windows, or Solaris functions for atomic memory access are not available, InnoDB will use the traditional Pthreads method of implementing mutexes and read/write locks.

When MySQL starts, InnoDB will write a message to the log file indicating whether atomic memory access will be used for mutexes, for mutexes and read/write locks, or neither. If suitable tools are used to build the InnoDB Plugin and the target CPU supports the atomic operations required, InnoDB will use the built-in functions for mutexing. If, in addition, the compare-and-swap operation can be used on thread identifiers (pthread_t), then InnoDB will use the instructions for read-write locks as well.

Note: If you are building from source, see Section 9.3.1, “Building the InnoDB Plugin on Linux or Unix” to ensure that your build process properly takes advantage of your platform capabilities.

7.3. Using Operating System Memory Allocators

When InnoDB was developed, the memory allocators supplied with operating systems and run-time libraries were often lacking in performance and scalability. At that time, there were no memory allocator libraries tuned for multi-core CPUs. Therefore, InnoDB implemented its own memory allocator in the mem subsystem. This allocator is guarded by a single mutex, which may become a bottleneck. InnoDB also implements a wrapper interface around the system allocator (malloc and free) that is likewise guarded by a single mutex.

Today, as multi-core systems have become more widely available, and as operating systems have matured, significant improvements have been made in the memory allocators provided with operating systems. New memory allocators perform better and are more scalable than they were in the past. The leading high-performance memory allocators include Hoard, libumem, mtmalloc, ptmalloc, tbbmalloc, and TCMalloc. Most workloads, especially those where memory is frequently allocated and released (such as multi-table joins) will benefit from using a more highly tuned memory allocator as opposed to the internal, InnoDB-specific memory allocator.

Beginning with InnoDB Plugin 1.0.3, you control whether InnoDB uses its own memory allocator or an allocator of the operating system, by setting the value of the new system configuration parameter innodb_use_sys_malloc in the MySQL option file (my.cnf or my.ini). If set to ON or 1 (the default), InnoDB will use the malloc and free functions of the underlying system rather than manage memory pools itself. This parameter is not dynamic, and takes effect only when the system is started. To continue to use the InnoDB memory allocator in InnoDB Plugin, you will have to set innodb_use_sys_malloc to 0.

Note that when the InnoDB memory allocator is disabled, InnoDB will ignore the value of the parameter innodb_additional_mem_pool_size. The InnoDB memory allocator uses an additional memory pool for satisfying allocation requests without having to fall back to the system memory allocator. When the InnoDB memory allocator is disabled, all such allocation requests will be fulfilled by the system memory allocator.

Furthermore, since InnoDB cannot track all memory use when the system memory allocator is used (innodb_use_sys_malloc is ON), the section “BUFFER POOL AND MEMORY” in the output of the SHOW ENGINE INNODB STATUS command will only include the buffer pool statistics in the “Total memory allocated”. Any memory allocated via the mem subsystem or via ut_malloc will be excluded.

On Unix-like systems that use dynamic linking, replacing the memory allocator may be as easy as making the environment variable LD_PRELOAD or LD_LIBRARY_PATH point to the dynamic library that implements the allocator. On other systems, some relinking may be necessary. Please refer to the documentation of the memory allocator library of your choice.

7.4. Controlling InnoDB Insert Buffering

When INSERTs are done to a table, often the values of indexed columns (particularly the values of secondary keys) are not in sorted order. This means that the inserts of such values into secondary B-tree indexes is “random”, and this can cause excessive i/o if the entire index does not fit in memory. InnoDB has an insert buffer that caches changes to secondary index entries when the relevant page is not in the buffer pool, thus avoiding I/O operations by not reading in the page from the disk. The buffered changes are written into a special insert buffer tree and are subsequently merged when the page is loaded to the buffer pool. The InnoDB main thread merges buffered changes when the server is nearly idle.

Usually, this process will result in fewer disk reads and writes, especially during bulk inserts. However, the insert buffer tree will occupy a part of the buffer pool. If the working set almost fits in the buffer pool, it may be useful to disable insert buffering. If the working set entirely fits in the buffer pool, insert buffering will not be used anyway, because the index would exist in memory.

Beginning with InnoDB Plugin 1.0.3, you can control whether InnoDB performs insert buffering with the system configuration parameter innodb_change_buffering. The allowed values of innodb_change_buffering are none (do not buffer any operations) and inserts (buffer insert operations, the default). You can set the value of this parameter in the MySQL option file (my.cnf or my.ini) or change it dynamically with the SET GLOBAL command, which requires the SUPER privilege. Changing the setting affects the buffering of new operations; the merging of already buffered entries is not affected.

7.5. Controlling Adaptive Hash Indexing

If a table fits almost entirely in main memory, the fastest way to perform queries on it is to use hash indexes rather than B-tree lookups. InnoDB monitors searches on each index defined for a table. If it notices that certain index values are being accessed frequently, it automatically builds an in-memory hash table for that index. Based on the pattern of searches that InnoDB observes, it will build a hash index using a prefix of the index key. The prefix of the key can be any length, and it may be that only a subset of the values in the B-tree will appear in the hash index. InnoDB builds hash indexes on demand for those pages of the index that are often accessed.

The adaptive hash index mechanism allows InnoDB to take advantage of large amounts of memory, something typically done only by database systems specifically designed for databases that reside entirely in memory. Normally, the automatic building and use of adaptive hash indexes will improve performance. However, sometimes, the read/write lock that guards access to the adaptive hash index may become a source of contention under heavy workloads, such as multiple concurrent joins.

You can monitor the use of the adaptive hash index and the contention for its use in the “SEMAPHORES” section of the output of the SHOW ENGINE INNODB STATUS command. If you see many threads waiting on an RW-latch created in btr0sea.c, then it might be useful to disable adaptive hash indexing.

The configuration parameter innodb_adaptive_hash_index can be set to disable or enable the adaptive hash index. See Section 8.3.4, “Dynamically Changing innodb_adaptive_hash_index for details.

7.6. Changes Regarding Thread Concurrency

InnoDB uses operating system threads to process requests from user transactions. (Transactions may issue many requests to InnoDB before they commit or roll back.) On today’s modern operating systems and servers with multi-core processors, where context switching is efficient, most workloads will run well without any limit on the number of concurrent threads. Thanks to several scalability improvements in InnoDB Plugin 1.0.3, and further changes in release 1.0.4, there should be less need to artificially limit the number of concurrently executing threads inside InnoDB.

However, for some situations, it may be helpful to minimize context switching between threads. InnoDB can use a number of techniques to limit the number of concurrently executing operating system threads (and thus the number of requests that are processed at any one time). When InnoDB receives a new request from a user session, if the number of threads concurrently executing is at a pre-defined limit, the new request will sleep for a short time before it tries again. A request that cannot be rescheduled after the sleep is put in a first-in/first-out queue and eventually will be processed. Threads waiting for locks are not counted in the number of concurrently executing threads.

The limit on the number of concurrent threads is given by the settable global variable innodb_thread_concurrency. Once the number of executing threads reaches this limit, additional threads will sleep for a number of microseconds, set by the system configuration parameter innodb_thread_sleep_delay, before being placed into the queue.

The default value for innodb_thread_concurrency and the implied default limit on the number of concurrent threads has been changed in various releases of MySQL and the InnoDB Plugin. Starting with InnoDB Plugin 1.0.3, the default value of innodb_thread_concurrency is 0, so that by default there is no limit on the number of concurrently executing threads, as shown in Table 7.1, “Changes to innodb_thread_concurrency.

Table 7.1. Changes to innodb_thread_concurrency

InnoDB VersionMySQL VersionDefault valueDefault limit of concurrent threadsValue to allow unlimited threads
Built-inEarlier than 5.1.1120No limit20 or higher
Built-in5.1.11 and newer880
InnoDB Plugin before 1.0.3(corresponding to Plugin)880
InnoDB Plugin 1.0.3 and newer(corresponding to Plugin)0No limit0

Note that InnoDB will cause threads to sleep only when the number of concurrent threads is limited. When there is no limit on the number of threads, all will contend equally to be scheduled. That is, if innodb_thread_concurrency is 0, the value of innodb_thread_sleep_delay is ignored.

When there is a limit on the number of threads, InnoDB reduces context switching overhead by permitting multiple requests made during the execution of a single SQL statement to enter InnoDB without observing the limit set by innodb_thread_concurrency. Since a SQL statement (such as a join) may comprise multiple row operations within InnoDB, InnoDB assigns “tickets” that allow a thread to be scheduled repeatedly with minimal overhead.

When starting to execute a new SQL statement, a thread will have no tickets, and it must observe innodb_thread_concurrency. Once the thread is entitled to enter InnoDB, it will be assigned a number of tickets that it can use for subsequently entering InnoDB. If the tickets run out, innodb_thread_concurrency will be observed again and further tickets will be assigned. The number of tickets to assign is specified by the global option innodb_concurrency_tickets, which is 500 by default. A thread that is waiting for a lock will be given one ticket once the lock becomes available.

The correct values of these variables are dependent on your environment and workload. You will need to try a range of different values to determine what value works for your applications. Before limiting the number of concurrently executing threads, you should review configuration options that may improve the performance of InnoDB on multi-core and multi-processor computers, such as innodb_use_sys_malloc and innodb_adaptive_hash_index.

7.7. Changes in the Read Ahead Algorithm

A read ahead request is an I/O request to prefetch multiple pages in the buffer cache asynchronously in anticipation that these pages will be needed in the near future. InnoDB has historically used two read ahead algorithms to improve I/O performance.

Random read ahead is done if a certain number of pages from the same extent (64 consecutive pages) are found in the buffer cache. In such cases, InnoDB asynchronously issues a request to prefetch the remaining pages of the extent. Random read ahead added unnecessary complexity to the InnoDB code and often resulted in performance degradation rather than improvement. Starting with InnoDB Plugin 1.0.4, this feature has been removed from InnoDB, and users should generally see equivalent or improved performance.

Linear read ahead is based on the access pattern of the pages in the buffer cache, not just their number. In releases before 1.0.4, if most pages belonging to some extent are accessed sequentially, InnoDB will issue an asynchronous prefetch request for the entire next extent when it reads in the last page of the current extent. Beginning with InnoDB Plugin 1.0.4, users can control when InnoDB performs a read ahead, by adjusting the number of sequential page accesses required to trigger an asynchronous read request using the new configuration parameter innodb_read_ahead_threshold.

If the number of pages read from an extent of 64 pages is greater or equal to innodb_read_ahead_threshold, InnoDB will initiate an asynchronous read ahead of the entire following extent. Thus, this parameter controls how sensitive InnoDB is to the pattern of page accesses within an extent in deciding whether to read the following extent asynchronously. The higher the value, the more strict will be the access pattern check. For example, if you set the value to 48, InnoDB will trigger a linear read ahead request only when 48 pages in the current extent have been accessed sequentially. If the value is 8, InnoDB would trigger an asynchronous read ahead even if as few as 8 pages in the extent were accessed sequentially.

The new configuration parameter innodb_read_ahead_threshold may be set to any value from 0-64. The default value is 56, meaning that an asynchronous read ahead is performed only when 56 of the 64 pages in the extent are accessed sequentially. You can set the value of this parameter in the MySQL option file (my.cnf or my.ini), or change it dynamically with the SET GLOBAL command, which requires the SUPER privilege.

7.8. Multiple Background I/O Threads

InnoDB uses background threads to service various types of I/O requests. Starting from InnoDB Plugin 1.0.4, the number of background threads tasked with servicing read and write I/O on data pages is configurable. In previous versions of InnoDB, there was only one thread each for read and write on non-Windows platforms. On Windows, the number of background threads was controlled by innodb_file_io_threads. The configuration parameter innodb_file_io_threads has been removed in InnoDB Plugin 1.0.4. If you try to set a value for this parameter, a warning will be written to the log file and the value will be ignored.

In place of innodb_file_io_threads, two new configuration parameters are introduced in the InnoDB Plugin 1.0.4, which are effective on all supported platforms. The two parameters innodb_read_io_threads and innodb_write_io_threads signify the number of background threads used for read and write requests respectively. You can set the value of these parameters in the MySQL option file (my.cnf or my.ini). These parameters cannot be changed dynamically. The default value for these parameters is 4 and the permissible values range from 1-64.

The purpose of this change is to make InnoDB more scalable on high end systems. Each background thread can handle up to 256 pending I/O requests. A major source of background I/O is the read ahead requests. InnoDB tries to balance the load of incoming requests in such way that most of the background threads share work equally. InnoDB also attempts to allocate read requests from the same extent to the same thread to increase the chances of coalescing the requests together. If you have a high end I/O subsystem and you see more than innodb_read_io_threads * 64 pending read requests in SHOW INNODB STATUS, then you may gain by increasing the value of innodb_read_io_threads.

7.9. Group Commit

InnoDB, like any other ACID compliant database engine, is required to flush the redo log of a transaction before it is committed. Historically InnoDB used group commit functionality to group multiple such flush requests together to avoid one flush for each commit. With group commit, InnoDB can issue a single write to the log file to effectuate the commit action for multiple user transactions that commit at about the same time, significantly improving throughput.

Group commit in InnoDB worked until MySQL 4.x. With the introduction of support for the distributed transactions and Two Phase Commit (2PC) in MySQL 5.0, group commit functionality inside InnoDB was broken.

Beginning with InnoDB Plugin 1.0.4, the group commit functionality inside InnoDB works with the Two Phase Commit protocol in MySQL. Re-enabling of the group commit functionality fully ensures that the ordering of commit in the MySQL binlog and the InnoDB logfile is the same as it was before. It means it is totally safe to use InnoDB Hot Backup with InnoDB Plugin 1.0.4.

Group commit is transparent to the user and nothing needs to be done by the user to take advantage of this significant performance improvement.

7.10. Controlling the Master Thread I/O Rate

The master thread in InnoDB is a thread that performs various tasks in the background. Most of these tasks are I/O related like flushing of the dirty pages from the buffer cache or writing the buffered inserts to the appropriate secondary indexes. The master thread attempts to perform these tasks in a way that does not adversely affect the normal working of the server. It tries to estimate the free I/O bandwidth available and tune its activities to take advantage of this free capacity. Historically, InnoDB has used a hard coded value of 100 IOPs (input/output operations per second) as the total I/O capacity of the server.

Beginning with InnoDB Plugin 1.0.4, a new configuration parameter is introduced to indicate the overall I/O capacity available to InnoDB. The new parameter innodb_io_capacity should be set to approximately the number of I/O operations that the system can perform per second. The value will of course depend on your system configuration. When innodb_io_capacity is set, the master threads estimates the I/O bandwidth available for background tasks based on the set value. Setting the value to 100 will result in reverting to the old behavior.

You can set the value of innodb_io_capacity to any number greater than 100, and the default value is 200. You can set the value of this parameter in the MySQL option file (my.cnf or my.ini) or change it dynamically with the SET GLOBAL command, which requires the SUPER privilege.

7.11. Controlling the Flushing Rate of Dirty Pages

InnoDB performs certain tasks in the background, including flushing of dirty pages (those pages that have been changed but are not yet written to the database files) from the buffer cache, a task performed by the “master thread”. Currently, the master thread aggressively flushes buffer pool pages if the percentage of dirty pages in the buffer pool exceeds innodb_max_dirty_pages_pct.

This behavior can cause temporary reductions in throughput when excessive buffer pool flushing takes place, limiting the I/O capacity available for ordinary read and write activity. Beginning with release 1.0.4, InnoDB Plugin uses a new algorithm to estimate the required rate of flushing based on the speed of redo log generation and the current rate of flushing. The intent of this change is to smooth overall performance, eliminating steep dips in throughput, by ensuring that buffer flush activity keeps up with the need to keep the buffer pool “clean”.

Remember that InnoDB uses its log files in a circular fashion. To make a log file (or a portion of it) reusable, InnoDB must flush to disk all dirty buffer pool pages whose redo entries are contained in that portion of the log file. When required, InnoDB performs a so-called “sharp checkpoint” by flushing the appropriate dirty pages to make space available in the log file. If a workload is write intensive, it will generate a lot of redo information (writes to the log file). In this case, it is possible that available space in the log files will be used up, even though innodb_max_dirty_pages_pct is not reached. This will cause a sharp checkpoint, causing a temporary reduction in throughput.

InnoDB Plugin 1.0.4 uses a new heuristic based algorithm to avoid such a scenario. The heuristic is a function of the number of dirty pages in the buffer cache and the rate at which redo is being generated. Based on this heuristic, the master thread will decide how many dirty pages to flush from the buffer cache each second. This self adapting heuristic is able to deal with sudden changes in the workload.

The primary aim of this feature is to smooth out I/O activity, avoiding sudden dips in throughput when flushing activity becomes high. Internal benchmarking has also shown that this algorithm not only maintains throughput over time, but can also improve overall throughput significantly.

Because adaptive flushing is a new feature that can significantly affect the I/O pattern of a workload, the InnoDB Plugin introduces a new configuration parameter that can be used to disable this feature. The default value of the new boolean parameter innodb_adaptive_flushing is TRUE, enabling the new algorithm. You can set the value of this parameter in the MySQL option file (my.cnf or my.ini) or change it dynamically with the SET GLOBAL command, which requires the SUPER privilege.

7.12. Using the PAUSE instruction in InnoDB spin loops

Synchronization inside InnoDB frequently involves the use of spin loops (where, while waiting, InnoDB executes a tight loop of instructions repeatedly to avoid having the InnoDB process and threads be rescheduled by the operating system). If the spin loops are executed too quickly, system resources are wasted, imposing a relatively severe penalty on transaction throughput. Most modern processors implement the PAUSE instruction for use in spin loops, so the processor can be more efficient.

Beginning with 1.0.4, the InnoDB Plugin uses a PAUSE instruction in its spin loops on all platforms where such an instruction is available. This technique increases overall performance with CPU-bound workloads, and has the added benefit of minimizing power consumption during the execution of the spin loops.

Using the PAUSE instruction in InnoDB spin loops is transparent to the user. User does not have to do anything to take advantage of this performance improvement.

7.13. Control over spin lock polling

Many InnoDB mutexes and rw-locks are reserved for a short amount of time. On a multi-core system, it is often more efficient for a thread to actively poll a mutex or rw-lock for a while before sleeping. If the mutex or rw-lock becomes available during this polling period, the thread may continue immediately, in the same time slice. Alas, if a shared object is being polled too frequently by multiple threads, it may result in “cache ping-pong”, the shipping of cache lines between processors. InnoDB tries to avoid this by making threads busy, waiting a random time between subsequent polls. The delay is implemented as a busy loop.

Starting with InnoDB Plugin 1.0.4, it is possible to control the maximum delay between sampling a mutex or rw-lock using the new parameter innodb_spin_wait_delay. In the 100 MHz Pentium era, the unit of delay used to be one microsecond. The duration of the delay loop depends on the C compiler and the target processor. On a system where all processor cores share a fast cache memory, it might be useful to reduce the maximum delay or disable the busy loop altogether by setting innodb_spin_wait_delay=0. On a system that consists of multiple processor chips, the shipping of cache lines can be slower and it may be useful to increase the maximum delay.

The default value of innodb_spin_wait_delay is 6. The spin wait delay is a dynamic, global parameter that can be specified in the MySQL option file (my.cnf or my.ini) or changed at runtime with the command SET GLOBAL innodb_spin_wait_delay=delay, where delay is the desired maximum delay. Changing the setting requires the SUPER privilege.