- Overview of In-Memory OLTP
- In-Memory Optimization Requirements
- Limitations of In-Memory OLTP
- Using In-Memory OLTP
- Using Memory-Optimized Tables
- Logging, Checkpoint, and Recovery for In-Memory OLTP
- Managing Memory for In-Memory OLTP
- Backup and Recovery of Memory-Optimized Databases
- Migrating to In-Memory OLTP
- Dynamic Management Views for In-Memory OLTP
- The Buffer Pool Extension
- Summary
Logging, Checkpoint, and Recovery for In-Memory OLTP
Any data modifications on durable, memory-optimized tables (those created with durability = schema_and_data) are logged in the database transaction log to guarantee recovery of the tables to a known state after a system shutdown or failure. However, logging for memory-optimized tables is done differently than for disk-based tables.
In addition to writing log records for durable memory-optimized tables to disk, In-Memory OTLP also invokes a checkpoint process to write data for the tables to durable storage as well. SQL Server writes these pieces of information using log streams and checkpoint streams.
Log streams contain the changes made by committed transactions logged as insertion and deletion of row versions and are stored in the SQL Server transaction log.
Checkpoint streams come in two varieties, data streams and delta streams. Data streams contain all versions inserted during a timestamp interval. Delta streams are associated with a particular data stream and contain a list of integers indicating which versions in its corresponding data stream have been deleted. Checkpoint streams are stored in SQL Server filestream files which in essence are sequential files fully managed by SQL Server.
Transaction Logging
In-Memory OLTP’s transaction logging is optimized for scalability and high performance through the use of reduced logging. Given the same workload, In-Memory OLTP will write significantly fewer log records for a memory-optimized table than for an equivalent disk-based table.
As described in the previous section, In-Memory OLTP does not use write-ahead logging as it does for disk-based tables. It only generates and saves log records at the time of the transaction commit rather than during each row operation. Since no ”dirty data” from uncommitted transactions is ever written to disk, there’s also no need to write any undo records to the transaction log since In-Memory OLTP won’t ever need to rollback any uncommitted changes.
In addition, if you remember from the previous description of memory-optimized indexes, the indexes on memory-optimized tables are memory resident only. In-Memory OLTP doesn’t write any index modifications to the transaction log either.
A third mechanism to reduce logging overhead for memory-optimized tables is to group multiple changes into one large log record (the max log records size in SQL Server 2014 is 24KB). Combining multiple log records into larger log records results in a fewer number of log records being written which minimizes the log-header overhead and helps reduce the contention for inserting into the log-buffer.
Checkpoint
If the transaction log was never truncated, all the changes that happened to your memory-optimized tables could be reconstructed from the transactions recorded in the log. However, this is not a feasible approach as the recovery time would be unacceptably long. Just as for operations on disk-based tables, one of the main reasons for checkpoint operations is to reduce recovery time. For In-Memory OLTP, the checkpoint process for memory-optimized tables is designed to satisfy two important requirements.
- Continuous Checkpointing—Checkpoint operations occur incrementally and continuously as transactional activity accumulates. A background process continuously scans transaction log records and writes to the data and delta files on disk.
- Streaming I/O—Writing to the data and delta files is done in an append-only manner by appending newly created rows to the end of the current data file and by appending the deleted rows to the corresponding delta file.
Checkpointing is the continuous background process of constructing data files and delta files and writing to them from the transaction log.
A checkpoint data file contains rows from one or more memory-optimized tables inserted by multiple transactions as part of INSERT or UPDATE operations. Multiple data files are created with each file covering a specific timestamp range. Each data file can be up to a maximum of 128MB in size (16MB for systems with 16GB or less of memory).
Once a data file is full, the rows inserted by new transactions are stored in another data file. Over time, the rows from durable memory-optimized tables are stored across one or more data files with each data file containing rows with a commit timestamp within the range of transaction timestamps contained in the file. For example a data file with transaction commit timestamps in the range of (100, 200) has all the rows inserted by transactions that have commit timestamps in this range.
Data files are append-only while they are open and are written to using sequential filestream I/O. Once data files are closed, they are strictly read-only. At recovery time the valid versions stored in the data files are reloaded into memory and reindexed.
When a row is deleted or updated by a future transaction, the row is not removed or changed in-place in the data file. Instead, the deleted rows are tracked in another type of ‘delta’ file. This eliminates random I/O on the data file. Each data file is paired with a corresponding delta file.
A delta file stores information about which rows contained in a data file have been subsequently deleted. There is a 1:1 correspondence between delta files and data files and the delta file has the same transaction range as its corresponding data file. Like data files, the delta file is written to using sequential filestream I/O. Delta files are append-only for the lifetime of the data file they correspond to. At recovery time, the delta file is used as a filter to avoid reloading deleted versions into memory. Because each data file is paired with exactly one delta file, the smallest unit of work for recovery is a single data/delta file pair. This allows the recovery process to be highly parallelizable.
A data and delta file pair is referred to as a Checkpoint File Pair (CFP). A maximum of 8192 CFPs are supported for each database.
Checkpoint Events
The data and delta files are written to continuously by the checkpoint background process. While the file is open, it has a state of UNDER CONSTRUCTION. When a checkpoint event occurs, the files UNDER CONSTRUCTION are closed and the state is set to ACTIVE. From this point, the new INSERTs will no longer be written to, but they are still ACTIVE since they are still subject to DELETE and UPDATE operations which will mark any corresponding rows in the data file as deleted in the associated delta file.
Checkpoint events for memory-optimized tables are independent of checkpoints for disk-based tables. Checkpoints for disk-based tables are generated based upon the configured recovery interval and involve flushing all ”dirty pages” from the buffer pool to disk. A complete checkpoint of memory-optimized tables consists of multiple data and delta files, plus a checkpoint file inventory that contains references to all the data and delta files that make up a complete checkpoint.
The completion of a checkpoint involves flushing the latest content of data and delta files to disk and constructing the checkpoint inventory which is written to the transaction log. Checkpoint events are generated either automatically or manually.
An automatic checkpoint occurs when the overall size of the transaction log has grown by 512MB since the last checkpoint. This includes operations logged for disk-based tables as well. It is not dependent on the amount of work done on memory-optimized tables. It’s conceivable that there may not have been any transactions on memory-optimized tables when an automatic checkpoint event occurs.
Manual checkpoints occur whenever an explicit CHECKPOINT command is initiated in the database and includes checkpoint operations for both disk-based tables and memory-optimized tables.
Merging Checkpoint Files
The number of checkpoint files can continue to accumulate. As data modifications proceed, a ”deleted” row remains in the data file but the delta file records the fact that it was deleted. Over time, the percentage of meaningful content in older data files falls, due to DELETEs and UPDATEs.
Since the recovery process reads the content of all data and delta files, recovery times will increase as it has to scan through a large number of files containing few relevant rows. To reduce the number of CFPs, SQL Server will eventually merge adjacent data files, so that rows marked as deleted actually get deleted from the checkpoint data file, and create a new CFP from the merged data files.
A background task runs periodically to examine all ACTIVE CFPs to determine if any adjacent sets of CFPs qualify to be merged. Files can be merged when the percentage of undeleted rows falls below a threshold. The main qualification must be that the merged files will result in a file of undeleted records that is 128MB or less in size (or 16MB for systems with 16GB of memory or less). Merging can also occur if two adjacent files are both less them 50% full (possibly as the result of a manual checkpoint having been run).
In most cases, automatic merging of checkpoint files will be sufficient to keep the number of files manageable. However, in rare situations or for testing purposes, you can manually initiate a checkpoint merge using the sp_xtp_merge_checkpoint_files stored procedure. To identify files that might be eligible for merge, you can examine the information returned by a query against the sys.dm_db_xtp_checkpoint_files DMV:
select file_type_desc, state_desc, lower_bound_tsn, upper_bound_tsn, file_size_in_bytes, file_size_used_in_bytes from sys.dm_db_xtp_checkpoint_files WHERE state_desc = 'ACTIVE'
Checkpoint Garbage Collection
At a certain point when the rows in a CFP are no longer needed (e.g., the oldest transaction still required by SQL Server is more recent than the time range covered by the CFP), the CFP will transition into a non-active state. This usually occurs when the log truncation point required for recovery is more recent than the largest transaction ID in the CFP transaction ID range.
Assuming a log backup has occurred, these CFPs are no longer required and can be removed by the checkpoint file garbage collection process which is a background process that runs automatically.
Recovery
The basic mechanism to recover or restore a database with memory-optimized tables is similar to the recovery process of databases with only disk-based tables. However, recovery of memory-optimized tables also includes the step of loading the memory-optimized tables into memory before the database is available for user access.
When SQL Server restarts, each database goes through a recovery process that consists of three phases:
- The analysis phase.
- The redo phase.
- The undo phase.
During the analysis phase, the In-Memory OLTP engine identifies the checkpoint inventory to load and preloads its system table log entries. It will also process some file allocation log records.
During the redo phase, for memory-optimized tables, data from the data and delta file pairs are loaded into memory and then the data is updated from the active transaction log based on the last durable checkpoint and the in-memory tables are populated and indexes rebuilt. During the redo phase, disk-based and memory-optimized table recovery runs concurrently.
The undo phase is not needed for memory-optimized tables since In-Memory OLTP doesn’t record any uncommitted transactions for memory-optimized tables.
When the above operations are completed for both disk-based and memory-optimized tables, the database is available for access.
Since loading memory-optimized tables into memory can increase recovery time, the In-Memory OLTP engine attempts to improve the load time of memory-optimized data from data and delta files by loading the data/delta files in parallel. It does this by first creating a Delta Map Filter. One thread per container reads the delta files and creates a delta map filter. Once the delta-map filter is created, data files are read using as many threads as there are logical CPUs. Each thread reading the data file reads the data rows and checks it against the associated delta map and only inserts the row into table if this row has not been marked deleted.