「原创」竟然redolog写入机制都不懂…怎么破？

本文简要讲解了MySQL中 redolog 的写入机制，这有助于深入理解MySQL数据一致性和持久性的实现，也可以学习到如何利用 redolog 调优 MySQL 的 IOPS。

点击上方“后端开发技术”，选择“设为星标” ，优质资源及时送达

redo log 写入机制

在MySQL更新数据的时候，是先更新数据，然后生成redolog，此时redolog是prepare 状态，然后保存binlog，紧接着提交事务，此时redolog被改为 commit 状态，写入完成。

这个步骤在上一篇文章中已经详细介绍过，如果有不懂，请移步阅读。

「原创」从RedoLog和BinLog聊到一条Update语句的执行

2022-06-02

执行流程

对于上述流程，让我们拆开看看。

在未提交事务的时候，日志并不能直接写入磁盘中的 ib-logfile-n 文件（redo log磁盘中的文件名），所以需要一块内存暂时存放，于是页需要对应的buffer —— redolog buffer。

回顾整个redolog 的写入流程，在处于 prepare 阶段的时候，日志可能存储在 redolog buffer中，在事务提交后，日志被保存到 binlog 中。

最后保存commit事务保存到 binlog 中这个阶段其实有两种可能：

1.日志从 buffer 被 write 到文件系统的的 page cache 中，在适当的时候，系统会调用 fsync 将 page cache中的 redolog 数据会持久化到磁盘中（保存到 ib-logfile-n 文件）

2.日志直接从 buffer 持久化到了磁盘中。

tips：日志写到 redo log buffer 是很快的，wirte 到 page cache 也比较快，但是持久化到磁盘的速度会慢很多。

redolog 的 fsync 策略

注意，redolog 去write或者直接执行fsync的时候是将redolog buffer中的全部内容都会刷入。

事务commit的时候执行fsync
在ABC三个事务同时进行时，如果A事务被提交，B事务还未结束状态，c处于prepare已提交，BC都会跟随A事务被持久化到磁盘中。
定时任务每秒都会把 redo log buffer中的日志 write到文件系统中page cache中，然后调用fsync执行持久化
redo log buffer 占用的空间即将达到 innodb_log_buffer_size 一半的时候，后台线程会主动write写盘。

在 2、3、5阶段，都有可能将未提交的事务刷入磁盘。

redo log 的写入策略可以通过参数 innodb_flush_log_at_trx_commit 配置。

设置为 0 ：redo log 留在 redo log buffer 中，依靠定时任务每秒刷新到磁盘（不推荐），与事务提交无关。
设置为 1 （默认）：表示每次事务提交时都将 redo log 持久化到磁盘（最安全）
设置为 2 ：表示每次事务提交时都只是把 redo log 写到 page cache，每秒刷一次到磁盘（速度快，但是会丢1s的数据，甚至更多，1s并不严格）

为了方便大家做深入研究，我把官网对于这个参数的解释贴在了下方。

The default setting of 1 is required for full ACID compliance. Logs are written and flushed to disk at each transaction commit.

With a setting of 0, logs are written and flushed to disk once per second. Transactions for which logs have not been flushed can be lost in a crash.

With a setting of 2, logs are written after each transaction commit and flushed to disk once per second. Transactions for which logs have not been flushed can be lost in a crash.

For settings 0 and 2, once-per-second flushing is not 100% guaranteed. Flushing may occur more frequently due to DDL changes and other internal InnoDB activities that cause logs to be flushed independently of the innodb_flush_log_at_trx_commit setting, and sometimes less frequently due to scheduling issues. If logs are flushed once per second, up to one second of transactions can be lost in a crash. If logs are flushed more or less frequently than once per second, the amount of transactions that can be lost varies accordingly.

Log flushing frequency is controlled by innodb_flush_log_at_timeout, which allows you to set log flushing frequency to N seconds (where N is 1 ... 2700, with a default value of 1). However, any unexpected mysqld process exit can erase up to N seconds of transactions.

DDL changes and other internal InnoDB activities flush the log independently of the innodb_flush_log_at_trx_commit setting.

InnoDB crash recovery works regardless of the innodb_flush_log_at_trx_commit setting. Transactions are either applied entirely or erased entirely.

通过阅读可以发现，官方推荐设置“双1”，以保证数据的的一致性和持久性。

“双1”说的就是 redolog 每次提交事务都会去 write 和 fsync，而 binlog 也会在事务提交的时候做 fsync。

关于binlog的持久化机制，我会在以后详细讲解。

For durability and consistency in a replication setup that uses InnoDB with transactions:

If binary logging is enabled, set sync_binlog=1.

Always set innodb_flush_log_at_trx_commit=1.

其实每秒write和fsync的频率也可以设置，通过innodb_flush_log_at_timeout,默认是1。

补充知识：page cache是什么

Linux基础：Page Cache 定义

中文名称：页高速缓冲存储器，简称页高缓。

单位：页。

大小：动态变化，因为操作系统会将所有未直接分配给应用程序的物理内存都用于页面缓存。

文件系统层级的缓存：page cache 用于缓存文件的页数据，从磁盘中读取到的内容是存储在page cache里的。

说白了 page cache 就是一块物理内存，对应磁盘上的一块存储空间。当linux内核发起一个写请求时（例如进程发起write()请求），直接往cache中写入。内核会将被写入的page标记为dirty，并将其加入dirty list中。内核会周期性地将 dirty list中的page写回到磁盘上，从而使磁盘上的数据和内存中缓存的数据一致。

最后，欢迎大家提问和交流。

如果觉得对你有帮助，欢迎点赞或分享，感谢阅读！