目录
-
RDB(Redis Database)
-
简介
-
使用
-
优点
-
缺点
-
AOF(Append Only File)
-
简介
-
使用
-
优点
-
缺点
-
RDB + AOF
-
简介
-
使用
-
主从架构
-
全量同步
-
部分同步
-
主从复制复制缓冲区(与AOF使用同一份缓冲区)源码分析
-
主从复制风暴
-
哨兵架构(待完善)
-
集群(待完善)
Reids 持久化
内存存储基于性能方面和数据安全等级方面考量,一般会有两种形式持久化,一种是快照,另一种是日志记录。
RDB(Redis Database)
简介
RDB持久化是指在指定的间隔时间对内存中的数据进行快照存储。在默认情况下,Redis将内存数据保存在名为dump.rdb
的二进制 文件中。
使用
配置文件中配置生成快照策略(如果没有任何配置,则不使用RDB):
# 60s内至少有1000个键改动
# save 60 1000
也可以手动执行:
-
save
,同步执行,阻塞Redis其他命令,不需要额外内存 -
bgsave
,异步执行,不会阻塞Redis其他命令,需要额外内存
bgsave
命令借助操作系统的写时复制(Copy on write)机制,在生成快照的同时,依然可以正常处理写命令。bgsave
子线程由主线程fork
,子线程共享主线程的内存数据,当两个线程都是读操作时,互不影响,如果主线程有写命令,则触发COW,数据会被复制一份生成副本,主线程修改主内存数据,子线程把数据副本写入文件。(这个机制也体现了RDB的缺点,当写入快照后如果Redis宕机,主线程中对数据的操作将丢失)
优点
-
二进制文件,占用小,容易备份,方便传输,数据恢复时速度快 -
性能好,不影响主线程
缺点
-
容易丢数据 -
fork
子进程在数据集比较大时很耗时,占用CPU
AOF(Append Only File)
简介
AOF持久化方式会将修改的命令记录到文件appendonly.aof
中,通过不同的策略,可以保证数据安全等级。
AOF文件可能会变得很大,Redis提供了重写AOF文件的机制,可以定期重写文件生成新的AOF文件。(例如对一个计数器执行100次INCR
命令,AOF文件需要记录100条操作记录,可以通过一条SET
指令进行代替)
如果Redis在对AOF文件正在写入时宕机,造成AOF文件格式出现问题,Redis重启时会拒绝载入文件,可以这样解决:
-
备份现有AOF -
使用Redis自带的 redis-check-aof
程序修复AOF文件 -
比对两个文件,手动新增命令等
使用
修改配置启用AOF:
# appendonly yes
有三个策略用于控制把操作日志同步刷新到磁盘:
appendfsync always # 每次有新命令追加到AOF时就执行一次,安全性最高,最慢
appendfsync everysec # 每秒同步一次,如果故障,丢失一秒数据
appendfsync no # 操作系统处理,更快,数据安全性最低
配置AOF自动重写频率:
# auto-aof-rewrite-min-size 64mb AOF文件大小至少达到64M才重写
# auto-aof-rewrite-percentage 100 AOF文件自从上次重写后文件大小增长了100%再次触发重写
AOF重写跟RDB一样,也会fork出一个子进程工作:
-
fork子进程 -
子进程重写旧AOF文件到新AOF文件 -
主进程如果执行了新写入命令,除了写入内存,根据配置的AOF策略,把改动日志追加到旧AOF文件末尾,这样做可以保证如果重写AOF文件机器宕机,现有的(旧AOF文件)文件也是安全正确的 -
子进程重写完成后,给父进程发送信号 -
父进程接收到信号后,把内存缓存数据追加到新AOF文件末尾 -
删除旧AOF文件
优点
-
根据配置的AOF策略使数据安全等级更高 -
文件协议可读,可以手动订正错误操作
缺点
-
相同体量的数据集,AOF文件远大于RDB -
Redis启动时恢复内存数据速度慢
RDB + AOF
简介
AOF的缺点主要是物理空间文件体积大,恢复速度慢,采用RDB+AOF的方式可以很好得解决这个问题。数据记录采用AOF,保证数据的高安全性,在重写AOF文件时,把当前内存做RDB快照处理,附加增量AOF操作日志,两者合为一个文件。在Redis重启恢复数据时,先加载RDB内容,再redo AOF日志,如此AOF文件体积减小,重启效率大大提升。
使用
必须先开启AOF,再配置:
# aof-use-rdb-preamble yes
Redis高可用
主从架构
为master节点配置一个或多个slave节点,slave节点总是尝试与master节点达到数据一致状态。
在Redis 2.8以前,主从复制使用SYNC
指令,该指令的一个缺点是如果主从断联,会进行全量复制,性能不高。2.8以后,使用PSYNC
指令,具有两种模式,全量同步和部分同步。
图示:

全量同步
slave第一次连上master时,发送PSYNC
指令请求复制数据,master收到PSYNC
指令后,通过bgsave
生成RDB快照,同时缓存新修改命令(主缓冲区和复制缓冲区)。当持久化完成后,master会把REB文件发送给slave,slave加载RDB数据到内存中,master再将之前缓存的修改命令发送给slave。
部分同步
实现部分同步需要三个部分:
-
master的offset和slave的offset -
master的复制缓冲区(这里需要注意一点,这个缓冲区与AOF命令缓冲其实是同一个缓冲区) -
运行ID
执行复制的master和slave都会维护一个偏移量,用于记录当前同步的位置,当master和slave的偏移量一致时,说明数据一致。
复制缓冲区是master维护的固定长度的队列(默认为1MB),该缓冲区会记录master开始生成RDB快照后接收的客户端写命令,当slave连接到master时,会发送自己维护的offset,master根据这个offset决定如何同步:
-
offset在复制缓冲区中存在,部分同步 -
不存在,全量同步
运行ID是Redis运行唯一的标识ID,当slave和master第一次同步时,master会把自己的运行ID发送给slave进行保存。当slave重新连接到master时,会携带之前保存的master的运行ID。master会根据slave携带的运行ID进行判断:
-
如果ID相同,则说明slave之前同步的就是该master的数据,可以进行部分同步 -
如果ID不相同,说明slave之前同步的不是该master的数据,进行全量同步
主从复制复制缓冲区(与AOF使用同一份缓冲区)源码分析
master 创建缓冲区
/* ---------------------------------- MASTER -------------------------------- */
void createReplicationBacklog(void) {
serverAssert(server.repl_backlog == NULL);
server.repl_backlog = zmalloc(sizeof(replBacklog));
server.repl_backlog->ref_repl_buf_node = NULL;
server.repl_backlog->unindexed_count = 0;
server.repl_backlog->blocks_index = raxNew();
server.repl_backlog->histlen = 0;
/* We don't have any data inside our buffer, but virtually the first
* byte we have is the next byte that will be generated for the
* replication stream. */
server.repl_backlog->offset = server.master_repl_offset+1;
}
sever.repl_backlog
即为缓冲区,replBacklog *repl_backlog;
结构如下:
/* Replication backlog is not separate memory, it just is one consumer of
* the global replication buffer. This structure records the reference of
* replication buffers. Since the replication buffer block list may be very long,
* it would cost much time to search replication offset on partial resync, so
* we use one rax tree to index some blocks every REPL_BACKLOG_INDEX_PER_BLOCKS
* to make searching offset from replication buffer blocks list faster. */
typedef struct replBacklog {
listNode *ref_repl_buf_node; /* Referenced node of replication buffer blocks,
* see the definition of replBufBlock. */
size_t unindexed_count; /* The count from last creating index block. */
rax *blocks_index; /* The index of recorded blocks of replication
* buffer for quickly searching replication
* offset on partial resynchronization. */
long long histlen; /* Backlog actual data length */
long long offset; /* Replication "master offset" of first
* byte in the replication backlog buffer.*/
} replBacklog;
/* Node, List, and Iterator are the only data structures used currently. */
typedef struct listNode {
struct listNode *prev;
struct listNode *next;
void *value;
} listNode;
可以看到缓冲区实际上是通过一个不连续的分区的链表维护的listNode *ref_repl_buf_node;
,分割成每个大小为16K的节点。
查看注释可以找到void *value
指针指向的是replBufBlock
结构体类型,看下结构:
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | /
* | /
* | /
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
当 ReplicatonBacklog
或从库对某个 block 使用时,便对正在使用的 replBufBlock
增加引用计数,当从库使用完当前的 replBufBlock
(已经将数据发送给从库)时,就会对其 refcount
减 1 而且移动到下一个 replBufBlock
,并对其 refcount
加 1。
每个replBufBloc
节点的数据为16K。
#define CONFIG_REPL_BACKLOG_MIN_SIZE (1024*16) /* 16k */
/* This function is called when the user modifies the replication backlog
* size at runtime. It is up to the function to resize the buffer and setup it
* so that it contains the same data as the previous one (possibly less data,
* but the most recent bytes, or the same data and more free space in case the
* buffer is enlarged). */
void resizeReplicationBacklog(void) {
if (server.repl_backlog_size < CONFIG_REPL_BACKLOG_MIN_SIZE)
server.repl_backlog_size = CONFIG_REPL_BACKLOG_MIN_SIZE;
if (server.repl_backlog)
incrementalTrimReplicationBacklog(REPL_BACKLOG_TRIM_BLOCKS_PER_CALL);
}
缓冲区是不可能无限增长的,Redis会从头访问replBufBlock
链表,如果refcount
为0,就释放空间,直到查找到第一个不为0才停止。
/* Generally, we only have one replication buffer block to trim when replication
* backlog size exceeds our setting and no replica reference it. But if replica
* clients disconnect, we need to free many replication buffer blocks that are
* referenced. It would cost much time if there are a lots blocks to free, that
* will freeze server, so we trim replication backlog incrementally. */
void incrementalTrimReplicationBacklog(size_t max_blocks) {
serverAssert(server.repl_backlog != NULL);
size_t trimmed_blocks = 0;
while (server.repl_backlog->histlen > server.repl_backlog_size &&
trimmed_blocks < max_blocks)
{
/* We never trim backlog to less than one block. */
if (listLength(server.repl_buffer_blocks) <= 1) break;
/* Replicas increment the refcount of the first replication buffer block
* they refer to, in that case, we don't trim the backlog even if
* backlog_histlen exceeds backlog_size. This implicitly makes backlog
* bigger than our setting, but makes the master accept partial resync as
* much as possible. So that backlog must be the last reference of
* replication buffer blocks. */
listNode *first = listFirst(server.repl_buffer_blocks);
serverAssert(first == server.repl_backlog->ref_repl_buf_node);
replBufBlock *fo = listNodeValue(first);
if (fo->refcount != 1) break;
/* We don't try trim backlog if backlog valid size will be lessen than
* setting backlog size once we release the first repl buffer block. */
if (server.repl_backlog->histlen - (long long)fo->size <=
server.repl_backlog_size) break;
/* Decr refcount and release the first block later. */
fo->refcount--;
trimmed_blocks++;
server.repl_backlog->histlen -= fo->size;
/* Go to use next replication buffer block node. */
listNode *next = listNextNode(first);
server.repl_backlog->ref_repl_buf_node = next;
serverAssert(server.repl_backlog->ref_repl_buf_node != NULL);
/* Incr reference count to keep the new head node. */
((replBufBlock *)listNodeValue(next))->refcount++;
/* Remove the node in recorded blocks. */
uint64_t encoded_offset = htonu64(fo->repl_offset);
raxRemove(server.repl_backlog->blocks_index,
(unsigned char*)&encoded_offset, sizeof(uint64_t), NULL);
/* Delete the first node from global replication buffer. */
serverAssert(fo->refcount == 0 && fo->used == fo->size);
server.repl_buffer_mem -= (fo->size +
sizeof(listNode) + sizeof(replBufBlock));
listDelNode(server.repl_buffer_blocks, first);
}
/* Set the offset of the first byte we have in the backlog. */
server.repl_backlog->offset = server.master_repl_offset -
server.repl_backlog->histlen + 1;
}
这里就有一个问题了,如果通过链表维护缓冲区,而offset
记录在链表节点中,slave需要复制指定offset开始后的数据如何检索链表?
Redis采用了rax *blocks_index;
结构(升级版字典树)(我没看懂,有时间了再研究下,简单来说类似于跳表查找,提前每64个记录一个索引点,查询效率非常高):
/* Representation of a radix tree as implemented in this file, that contains
* the strings "foo", "foobar" and "footer" after the insertion of each
* word. When the node represents a key inside the radix tree, we write it
* between [], otherwise it is written between ().
*
* This is the vanilla representation:
*
* (f) ""
*
* (o) "f"
*
* (o) "fo"
*
* [t b] "foo"
* /
* "foot" (e) (a) "foob"
* /
* "foote" (r) (r) "fooba"
* /
* "footer" [] [] "foobar"
*
* However, this implementation implements a very common optimization where
* successive nodes having a single child are "compressed" into the node
* itself as a string of characters, each representing a next-level child,
* and only the link to the node representing the last character node is
* provided inside the representation. So the above representation is turned
* into:
*
* ["foo"] ""
* |
* [t b] "foo"
* /
* "foot" ("er") ("ar") "foob"
* /
* "footer" [] [] "foobar"
*
* However this optimization makes the implementation a bit more complex.
* For instance if a key "first" is added in the above radix tree, a
* "node splitting" operation is needed, since the "foo" prefix is no longer
* composed of nodes having a single child one after the other. This is the
* above tree and the resulting node splitting after this event happens:
*
*
* (f) ""
* /
* (i o) "f"
* /
* "firs" ("rst") (o) "fo"
* /
* "first" [] [t b] "foo"
* /
* "foot" ("er") ("ar") "foob"
* /
* "footer" [] [] "foobar"
*
* Similarly after deletion, if a new chain of nodes having a single child
* is created (the chain must also not include nodes that represent keys),
* it must be compressed back into a single node.
*
*/
#define RAX_NODE_MAX_SIZE ((1<<29)-1)
typedef struct raxNode {
uint32_t iskey:1; /* Does this node contain a key? */
uint32_t isnull:1; /* Associated value is NULL (don't store it). */
uint32_t iscompr:1; /* Node is compressed. */
uint32_t size:29; /* Number of children, or compressed string len. */
/* Data layout is as follows:
*
* If node is not compressed we have 'size' bytes, one for each children
* character, and 'size' raxNode pointers, point to each child node.
* Note how the character is not stored in the children but in the
* edge of the parents:
*
* [header iscompr=0][abc][a-ptr][b-ptr][c-ptr](value-ptr?)
*
* if node is compressed (iscompr bit is 1) the node has 1 children.
* In that case the 'size' bytes of the string stored immediately at
* the start of the data section, represent a sequence of successive
* nodes linked one after the other, for which only the last one in
* the sequence is actually represented as a node, and pointed to by
* the current compressed node.
*
* [header iscompr=1][xyz][z-ptr](value-ptr?)
*
* Both compressed and not compressed nodes can represent a key
* with associated data in the radix tree at any level (not just terminal
* nodes).
*
* If the node has an associated key (iskey=1) and is not NULL
* (isnull=0), then after the raxNode pointers pointing to the
* children, an additional value pointer is present (as you can see
* in the representation above as "value-ptr" field).
*/
unsigned char data[];
} raxNode;
typedef struct rax {
raxNode *head;
uint64_t numele;
uint64_t numnodes;
} rax;
主从复制风暴
如果从节点很多,为了缓解主从复制风暴(多个节点同时复制主节点导致主节点压力过大)可以让部分从节点与从节点之间同步数据。
哨兵架构(待完善)
哨兵是特殊的Redis服务,不提供读写服务,主要用来监控Redis实例节点。
Redis client第一次访问哨兵获取Redis主节点地址,后续直接访问Redis主节点,当Redis主节点发生变化,哨兵会感知并重新选举master并通知给client。
图示:

集群(待完善)
Redis集群是Redis提供的分布式数据库方案,通过分片进行数据共享,并提供复制和故障转移功能,并且可以水平扩展。
图示:

Redis集群是通过分片的方式来保存数据库中的键值对,整个数据库被分为16384个槽(slot),如何计算key属于哪个槽:
/* -----------------------------------------------------------------------------
* Key space handling
* -------------------------------------------------------------------------- */
/* We have 16384 hash slots. The hash slot of a given key is obtained
* as the least significant 14 bits of the crc16 of the key.
*
* However if the key contains the {...} pattern, only the part between
* { and } is hashed. This may be useful in the future to force certain
* keys to be in the same node (assuming no resharding is in progress). */
unsigned int keyHashSlot(char *key, int keylen) {
int s, e; /* start-end indexes of { and } */
for (s = 0; s < keylen; s++)
if (key[s] == '{') break;
/* No '{' ? Hash the whole key. This is the base case. */
// 0x3FFF ---- 16383
if (s == keylen) return crc16(key,keylen) & 0x3FFF;
/* '{' found? Check if we have the corresponding '}'. */
for (e = s+1; e < keylen; e++)
if (key[e] == '}') break;
/* No '}' or nothing between {} ? Hash the whole key. */
if (e == keylen || e == s+1) return crc16(key,keylen) & 0x3FFF;
/* If we are here there is both a { and a } on its right. Hash
* what is in the middle between { and }. */
return crc16(key+s+1,e-s-1) & 0x3FFF;
}
集群节点记录了哪些槽是本节点处理的:
typedef struct clusterNode {
mstime_t ctime; /* Node object creation time. */
char name[CLUSTER_NAMELEN]; /* Node name, hex string, sha1-size */
int flags; /* CLUSTER_NODE_... */
uint64_t configEpoch; /* Last configEpoch observed for this node */
// 记录哪些槽被本节点处理
unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
uint16_t *slot_info_pairs; /* Slots info represented as (start/end) pair (consecutive index). */
int slot_info_pairs_count; /* Used number of slots in slot_info_pairs */
// 本节点处理多少槽
int numslots; /* Number of slots handled by this node */
int numslaves; /* Number of slave nodes, if this is a master */
struct clusterNode **slaves; /* pointers to slave nodes */
struct clusterNode *slaveof; /* pointer to the master node. Note that it
may be NULL even if the node is a slave
if we don't have the master node in our
tables. */
mstime_t ping_sent; /* Unix time we sent latest ping */
mstime_t pong_received; /* Unix time we received the pong */
mstime_t data_received; /* Unix time we received any data */
mstime_t fail_time; /* Unix time when FAIL flag was set */
mstime_t voted_time; /* Last time we voted for a slave of this master */
mstime_t repl_offset_time; /* Unix time we received offset for this node */
mstime_t orphaned_time; /* Starting time of orphaned master condition */
long long repl_offset; /* Last known repl offset for this node. */
char ip[NET_IP_STR_LEN]; /* Latest known IP address of this node */
sds hostname; /* The known hostname for this node */
int port; /* Latest known clients port (TLS or plain). */
int pport; /* Latest known clients plaintext port. Only used
if the main clients port is for TLS. */
int cport; /* Latest known cluster port of this node. */
clusterLink *link; /* TCP/IP link established toward this node */
clusterLink *inbound_link; /* TCP/IP link accepted from this node */
list *fail_reports; /* List of nodes signaling this as failing */
} clusterNode;
当客户端向节点发送键相关命令时,接收命令的节点会计算命令要处理的键属于哪个槽,并检查自身是否处理这个槽:
-
如果这个槽由自己处理,则执行命令 -
如果这个槽不是自己处理的,返回错误信息(包含处理槽处于哪个节点),客户端重新请求正确的节点
查看下节点信息的结构:
typedef struct clusterState {
clusterNode *myself; /* This node */
uint64_t currentEpoch;
int state; /* CLUSTER_OK, CLUSTER_FAIL, ... */
int size; /* Num of master nodes with at least one slot */
dict *nodes; /* Hash table of name -> clusterNode structures */
dict *nodes_black_list; /* Nodes we don't re-add for a few seconds. */
// 记录槽指向的节点信息,节点信息包含了IP、端口等信息
clusterNode *migrating_slots_to[CLUSTER_SLOTS];
clusterNode *importing_slots_from[CLUSTER_SLOTS];
clusterNode *slots[CLUSTER_SLOTS];
rax *slots_to_channels;
/* The following fields are used to take the slave state on elections. */
mstime_t failover_auth_time; /* Time of previous or next election. */
int failover_auth_count; /* Number of votes received so far. */
int failover_auth_sent; /* True if we already asked for votes. */
int failover_auth_rank; /* This slave rank for current auth request. */
uint64_t failover_auth_epoch; /* Epoch of the current election. */
int cant_failover_reason; /* Why a slave is currently not able to
failover. See the CANT_FAILOVER_* macros. */
/* Manual failover state in common. */
mstime_t mf_end; /* Manual failover time limit (ms unixtime).
It is zero if there is no MF in progress. */
/* Manual failover state of master. */
clusterNode *mf_slave; /* Slave performing the manual failover. */
/* Manual failover state of slave. */
long long mf_master_offset; /* Master offset the slave needs to start MF
or -1 if still not received. */
int mf_can_start; /* If non-zero signal that the manual failover
can start requesting masters vote. */
/* The following fields are used by masters to take state on elections. */
uint64_t lastVoteEpoch; /* Epoch of the last vote granted. */
int todo_before_sleep; /* Things to do in clusterBeforeSleep(). */
/* Stats */
/* Messages received and sent by type. */
long long stats_bus_messages_sent[CLUSTERMSG_TYPE_COUNT];
long long stats_bus_messages_received[CLUSTERMSG_TYPE_COUNT];
long long stats_pfail_nodes; /* Number of nodes in PFAIL status,
excluding nodes without address. */
unsigned long long stat_cluster_links_buffer_limit_exceeded; /* Total number of cluster links freed due to exceeding buffer limit */
} clusterState;
参考资料
-
Redis官方文档 -
Linux写时复制机制原理 -
Redis专题:一文搞懂主从复制原理 -
Redis 7.0共享复制缓冲区的设计与实现 -
Redis(2)——数据结构之RAX
原文始发于微信公众号(erpang coding):Redis持久化和高可用
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/37383.html