Reids 持久化

内存存储基于性能方面和数据安全等级方面考量，一般会有两种形式持久化，一种是快照，另一种是日志记录。

RDB（Redis Database）

简介

RDB持久化是指在指定的间隔时间对内存中的数据进行快照存储。在默认情况下，Redis将内存数据保存在名为dump.rdb的二进制 文件中。

使用

配置文件中配置生成快照策略（如果没有任何配置，则不使用RDB）：

# 60s内至少有1000个键改动
# save 60 1000

也可以手动执行：

save，同步执行，阻塞Redis其他命令，不需要额外内存
bgsave，异步执行，不会阻塞Redis其他命令，需要额外内存

bgsave命令借助操作系统的写时复制（Copy on write）机制，在生成快照的同时，依然可以正常处理写命令。bgsave子线程由主线程fork，子线程共享主线程的内存数据，当两个线程都是读操作时，互不影响，如果主线程有写命令，则触发COW，数据会被复制一份生成副本，主线程修改主内存数据，子线程把数据副本写入文件。（这个机制也体现了RDB的缺点，当写入快照后如果Redis宕机，主线程中对数据的操作将丢失）

优点

二进制文件，占用小，容易备份，方便传输，数据恢复时速度快
性能好，不影响主线程

缺点

容易丢数据
fork子进程在数据集比较大时很耗时，占用CPU

AOF（Append Only File）

简介

AOF持久化方式会将修改的命令记录到文件appendonly.aof中，通过不同的策略，可以保证数据安全等级。

AOF文件可能会变得很大，Redis提供了重写AOF文件的机制，可以定期重写文件生成新的AOF文件。（例如对一个计数器执行100次INCR命令，AOF文件需要记录100条操作记录，可以通过一条SET指令进行代替）

如果Redis在对AOF文件正在写入时宕机，造成AOF文件格式出现问题，Redis重启时会拒绝载入文件，可以这样解决：

备份现有AOF
使用Redis自带的redis-check-aof程序修复AOF文件
比对两个文件，手动新增命令等

使用

修改配置启用AOF：

# appendonly yes

有三个策略用于控制把操作日志同步刷新到磁盘：

appendfsync always # 每次有新命令追加到AOF时就执行一次，安全性最高，最慢
appendfsync everysec # 每秒同步一次，如果故障，丢失一秒数据
appendfsync no # 操作系统处理，更快，数据安全性最低

配置AOF自动重写频率：

# auto-aof-rewrite-min-size 64mb AOF文件大小至少达到64M才重写
# auto-aof-rewrite-percentage 100 AOF文件自从上次重写后文件大小增长了100%再次触发重写

AOF重写跟RDB一样，也会fork出一个子进程工作：

fork子进程
子进程重写旧AOF文件到新AOF文件
主进程如果执行了新写入命令，除了写入内存，根据配置的AOF策略，把改动日志追加到旧AOF文件末尾，这样做可以保证如果重写AOF文件机器宕机，现有的（旧AOF文件）文件也是安全正确的
子进程重写完成后，给父进程发送信号
父进程接收到信号后，把内存缓存数据追加到新AOF文件末尾
删除旧AOF文件

优点

根据配置的AOF策略使数据安全等级更高
文件协议可读，可以手动订正错误操作

缺点

相同体量的数据集，AOF文件远大于RDB
Redis启动时恢复内存数据速度慢

RDB + AOF

简介

AOF的缺点主要是物理空间文件体积大，恢复速度慢，采用RDB+AOF的方式可以很好得解决这个问题。数据记录采用AOF，保证数据的高安全性，在重写AOF文件时，把当前内存做RDB快照处理，附加增量AOF操作日志，两者合为一个文件。在Redis重启恢复数据时，先加载RDB内容，再redo AOF日志，如此AOF文件体积减小，重启效率大大提升。

使用

必须先开启AOF，再配置：

# aof-use-rdb-preamble yes

Redis高可用

主从架构

为master节点配置一个或多个slave节点，slave节点总是尝试与master节点达到数据一致状态。

在Redis 2.8以前，主从复制使用SYNC指令，该指令的一个缺点是如果主从断联，会进行全量复制，性能不高。2.8以后，使用PSYNC指令，具有两种模式，全量同步和部分同步。

图示:

全量同步

slave第一次连上master时，发送PSYNC指令请求复制数据，master收到PSYNC指令后，通过bgsave生成RDB快照，同时缓存新修改命令（主缓冲区和复制缓冲区）。当持久化完成后，master会把REB文件发送给slave，slave加载RDB数据到内存中，master再将之前缓存的修改命令发送给slave。

部分同步

实现部分同步需要三个部分：

master的offset和slave的offset
master的复制缓冲区（这里需要注意一点，这个缓冲区与AOF命令缓冲其实是同一个缓冲区）
运行ID

执行复制的master和slave都会维护一个偏移量，用于记录当前同步的位置，当master和slave的偏移量一致时，说明数据一致。

复制缓冲区是master维护的固定长度的队列（默认为1MB），该缓冲区会记录master开始生成RDB快照后接收的客户端写命令，当slave连接到master时，会发送自己维护的offset，master根据这个offset决定如何同步：

offset在复制缓冲区中存在，部分同步
不存在，全量同步

运行ID是Redis运行唯一的标识ID，当slave和master第一次同步时，master会把自己的运行ID发送给slave进行保存。当slave重新连接到master时，会携带之前保存的master的运行ID。master会根据slave携带的运行ID进行判断：

如果ID相同，则说明slave之前同步的就是该master的数据，可以进行部分同步
如果ID不相同，说明slave之前同步的不是该master的数据，进行全量同步

主从复制复制缓冲区（与AOF使用同一份缓冲区）源码分析

master 创建缓冲区

/* ---------------------------------- MASTER -------------------------------- */

void createReplicationBacklog(void) {
    serverAssert(server.repl_backlog == NULL);
    server.repl_backlog = zmalloc(sizeof(replBacklog));
    server.repl_backlog->ref_repl_buf_node = NULL;
    server.repl_backlog->unindexed_count = 0;
    server.repl_backlog->blocks_index = raxNew();
    server.repl_backlog->histlen = 0;
    /* We don't have any data inside our buffer, but virtually the first
     * byte we have is the next byte that will be generated for the
     * replication stream. */
    server.repl_backlog->offset = server.master_repl_offset+1;
}

sever.repl_backlog即为缓冲区，replBacklog *repl_backlog;

结构如下：

/* Replication backlog is not separate memory, it just is one consumer of
 * the global replication buffer. This structure records the reference of
 * replication buffers. Since the replication buffer block list may be very long,
 * it would cost much time to search replication offset on partial resync, so
 * we use one rax tree to index some blocks every REPL_BACKLOG_INDEX_PER_BLOCKS
 * to make searching offset from replication buffer blocks list faster. */
typedef struct replBacklog {
    listNode *ref_repl_buf_node; /* Referenced node of replication buffer blocks,
                                  * see the definition of replBufBlock. */
    size_t unindexed_count;      /* The count from last creating index block. */
    rax *blocks_index;           /* The index of recorded blocks of replication
                                  * buffer for quickly searching replication
                                  * offset on partial resynchronization. */
    long long histlen;           /* Backlog actual data length */
    long long offset;            /* Replication "master offset" of first
                                  * byte in the replication backlog buffer.*/
} replBacklog;

/* Node, List, and Iterator are the only data structures used currently. */

typedef struct listNode {
    struct listNode *prev;
    struct listNode *next;
    void *value;
} listNode;

可以看到缓冲区实际上是通过一个不连续的分区的链表维护的listNode *ref_repl_buf_node;，分割成每个大小为16K的节点。

查看注释可以找到void *value指针指向的是replBufBlock结构体类型，看下结构：

/* Replication buffer blocks is the list of replBufBlock.
 *
 * +--------------+       +--------------+       +--------------+
 * | refcount = 1 |  ...  | refcount = 0 |  ...  | refcount = 2 |
 * +--------------+       +--------------+       +--------------+
 *      |                                            /       
 *      |                                           /         
 *      |                                          /           
 *  Repl Backlog                               Replia_A      Replia_B
 * 
 * Each replica or replication backlog increments only the refcount of the
 * 'ref_repl_buf_node' which it points to. So when replica walks to the next
 * node, it should first increase the next node's refcount, and when we trim
 * the replication buffer nodes, we remove node always from the head node which
 * refcount is 0. If the refcount of the head node is not 0, we must stop
 * trimming and never iterate the next node. */

/* Similar with 'clientReplyBlock', it is used for shared buffers between
 * all replica clients and replication backlog. */
typedef struct replBufBlock {
    int refcount;           /* Number of replicas or repl backlog using. */
    long long id;           /* The unique incremental number. */
    long long repl_offset;  /* Start replication offset of the block. */
    size_t size, used;
    char buf[];
} replBufBlock;

当 ReplicatonBacklog 或从库对某个 block 使用时，便对正在使用的 replBufBlock 增加引用计数，当从库使用完当前的 replBufBlock（已经将数据发送给从库）时，就会对其 refcount 减 1 而且移动到下一个 replBufBlock，并对其 refcount 加 1。

每个replBufBloc节点的数据为16K。

#define CONFIG_REPL_BACKLOG_MIN_SIZE (1024*16)          /* 16k */

/* This function is called when the user modifies the replication backlog
 * size at runtime. It is up to the function to resize the buffer and setup it
 * so that it contains the same data as the previous one (possibly less data,
 * but the most recent bytes, or the same data and more free space in case the
 * buffer is enlarged). */
void resizeReplicationBacklog(void) {
    if (server.repl_backlog_size < CONFIG_REPL_BACKLOG_MIN_SIZE)
        server.repl_backlog_size = CONFIG_REPL_BACKLOG_MIN_SIZE;
    if (server.repl_backlog)
        incrementalTrimReplicationBacklog(REPL_BACKLOG_TRIM_BLOCKS_PER_CALL);
}

缓冲区是不可能无限增长的，Redis会从头访问replBufBlock链表，如果refcount为0，就释放空间，直到查找到第一个不为0才停止。

/* Generally, we only have one replication buffer block to trim when replication
 * backlog size exceeds our setting and no replica reference it. But if replica
 * clients disconnect, we need to free many replication buffer blocks that are
 * referenced. It would cost much time if there are a lots blocks to free, that
 * will freeze server, so we trim replication backlog incrementally. */
void incrementalTrimReplicationBacklog(size_t max_blocks) {
    serverAssert(server.repl_backlog != NULL);

    size_t trimmed_blocks = 0;
    while (server.repl_backlog->histlen > server.repl_backlog_size &&
           trimmed_blocks < max_blocks)
    {
        /* We never trim backlog to less than one block. */
        if (listLength(server.repl_buffer_blocks) <= 1) break;

        /* Replicas increment the refcount of the first replication buffer block
         * they refer to, in that case, we don't trim the backlog even if
         * backlog_histlen exceeds backlog_size. This implicitly makes backlog
         * bigger than our setting, but makes the master accept partial resync as
         * much as possible. So that backlog must be the last reference of
         * replication buffer blocks. */
        listNode *first = listFirst(server.repl_buffer_blocks);
        serverAssert(first == server.repl_backlog->ref_repl_buf_node);
        replBufBlock *fo = listNodeValue(first);
        if (fo->refcount != 1) break;

        /* We don't try trim backlog if backlog valid size will be lessen than
         * setting backlog size once we release the first repl buffer block. */
        if (server.repl_backlog->histlen - (long long)fo->size <=
            server.repl_backlog_size) break;

        /* Decr refcount and release the first block later. */
        fo->refcount--;
        trimmed_blocks++;
        server.repl_backlog->histlen -= fo->size;

        /* Go to use next replication buffer block node. */
        listNode *next = listNextNode(first);
        server.repl_backlog->ref_repl_buf_node = next;
        serverAssert(server.repl_backlog->ref_repl_buf_node != NULL);
        /* Incr reference count to keep the new head node. */
        ((replBufBlock *)listNodeValue(next))->refcount++;

        /* Remove the node in recorded blocks. */
        uint64_t encoded_offset = htonu64(fo->repl_offset);
        raxRemove(server.repl_backlog->blocks_index,
            (unsigned char*)&encoded_offset, sizeof(uint64_t), NULL);

        /* Delete the first node from global replication buffer. */
        serverAssert(fo->refcount == 0 && fo->used == fo->size);
        server.repl_buffer_mem -= (fo->size +
            sizeof(listNode) + sizeof(replBufBlock));
        listDelNode(server.repl_buffer_blocks, first);
    }

    /* Set the offset of the first byte we have in the backlog. */
    server.repl_backlog->offset = server.master_repl_offset -
                              server.repl_backlog->histlen + 1;
}

这里就有一个问题了，如果通过链表维护缓冲区，而offset记录在链表节点中，slave需要复制指定offset开始后的数据如何检索链表？

Redis采用了rax *blocks_index;结构（升级版字典树）(我没看懂，有时间了再研究下，简单来说类似于跳表查找，提前每64个记录一个索引点，查询效率非常高）：

/* Representation of a radix tree as implemented in this file, that contains
 * the strings "foo", "foobar" and "footer" after the insertion of each
 * word. When the node represents a key inside the radix tree, we write it
 * between [], otherwise it is written between ().
 *
 * This is the vanilla representation:
 *
 *              (f) ""
 *                
 *                (o) "f"
 *                  
 *                  (o) "fo"
 *                    
 *                  [t   b] "foo"
 *                  /     
 *         "foot" (e)     (a) "foob"
 *                /         
 *      "foote" (r)         (r) "fooba"
 *              /             
 *    "footer" []             [] "foobar"
 *
 * However, this implementation implements a very common optimization where
 * successive nodes having a single child are "compressed" into the node
 * itself as a string of characters, each representing a next-level child,
 * and only the link to the node representing the last character node is
 * provided inside the representation. So the above representation is turned
 * into:
 *
 *                  ["foo"] ""
 *                     |
 *                  [t   b] "foo"
 *                  /     
 *        "foot" ("er")    ("ar") "foob"
 *                 /          
 *       "footer" []          [] "foobar"
 *
 * However this optimization makes the implementation a bit more complex.
 * For instance if a key "first" is added in the above radix tree, a
 * "node splitting" operation is needed, since the "foo" prefix is no longer
 * composed of nodes having a single child one after the other. This is the
 * above tree and the resulting node splitting after this event happens:
 *
 *
 *                    (f) ""
 *                    /
 *                 (i o) "f"
 *                 /   
 *    "firs"  ("rst")  (o) "fo"
 *              /        
 *    "first" []       [t   b] "foo"
 *                     /     
 *           "foot" ("er")    ("ar") "foob"
 *                    /          
 *          "footer" []          [] "foobar"
 *
 * Similarly after deletion, if a new chain of nodes having a single child
 * is created (the chain must also not include nodes that represent keys),
 * it must be compressed back into a single node.
 *
 */

#define RAX_NODE_MAX_SIZE ((1<<29)-1)
typedef struct raxNode {
    uint32_t iskey:1;     /* Does this node contain a key? */
    uint32_t isnull:1;    /* Associated value is NULL (don't store it). */
    uint32_t iscompr:1;   /* Node is compressed. */
    uint32_t size:29;     /* Number of children, or compressed string len. */
    /* Data layout is as follows:
     *
     * If node is not compressed we have 'size' bytes, one for each children
     * character, and 'size' raxNode pointers, point to each child node.
     * Note how the character is not stored in the children but in the
     * edge of the parents:
     *
     * [header iscompr=0][abc][a-ptr][b-ptr][c-ptr](value-ptr?)
     *
     * if node is compressed (iscompr bit is 1) the node has 1 children.
     * In that case the 'size' bytes of the string stored immediately at
     * the start of the data section, represent a sequence of successive
     * nodes linked one after the other, for which only the last one in
     * the sequence is actually represented as a node, and pointed to by
     * the current compressed node.
     *
     * [header iscompr=1][xyz][z-ptr](value-ptr?)
     *
     * Both compressed and not compressed nodes can represent a key
     * with associated data in the radix tree at any level (not just terminal
     * nodes).
     *
     * If the node has an associated key (iskey=1) and is not NULL
     * (isnull=0), then after the raxNode pointers pointing to the
     * children, an additional value pointer is present (as you can see
     * in the representation above as "value-ptr" field).
     */
    unsigned char data[];
} raxNode;

typedef struct rax {
    raxNode *head;
    uint64_t numele;
    uint64_t numnodes;
} rax;

主从复制风暴

如果从节点很多，为了缓解主从复制风暴（多个节点同时复制主节点导致主节点压力过大）可以让部分从节点与从节点之间同步数据。

哨兵架构（待完善）

哨兵是特殊的Redis服务，不提供读写服务，主要用来监控Redis实例节点。

Redis client第一次访问哨兵获取Redis主节点地址，后续直接访问Redis主节点，当Redis主节点发生变化，哨兵会感知并重新选举master并通知给client。

图示：

集群（待完善）

Redis集群是Redis提供的分布式数据库方案，通过分片进行数据共享，并提供复制和故障转移功能，并且可以水平扩展。

图示：

Redis集群是通过分片的方式来保存数据库中的键值对，整个数据库被分为16384个槽（slot），如何计算key属于哪个槽：

/* -----------------------------------------------------------------------------
 * Key space handling
 * -------------------------------------------------------------------------- */

/* We have 16384 hash slots. The hash slot of a given key is obtained
 * as the least significant 14 bits of the crc16 of the key.
 *
 * However if the key contains the {...} pattern, only the part between
 * { and } is hashed. This may be useful in the future to force certain
 * keys to be in the same node (assuming no resharding is in progress). */
unsigned int keyHashSlot(char *key, int keylen) {
    int s, e; /* start-end indexes of { and } */

    for (s = 0; s < keylen; s++)
        if (key[s] == '{') break;

    /* No '{' ? Hash the whole key. This is the base case. */
    // 0x3FFF ---- 16383
    if (s == keylen) return crc16(key,keylen) & 0x3FFF;

    /* '{' found? Check if we have the corresponding '}'. */
    for (e = s+1; e < keylen; e++)
        if (key[e] == '}') break;

    /* No '}' or nothing between {} ? Hash the whole key. */
    if (e == keylen || e == s+1) return crc16(key,keylen) & 0x3FFF;

    /* If we are here there is both a { and a } on its right. Hash
     * what is in the middle between { and }. */
    return crc16(key+s+1,e-s-1) & 0x3FFF;
}

集群节点记录了哪些槽是本节点处理的：

typedef struct clusterNode {
    mstime_t ctime; /* Node object creation time. */
    char name[CLUSTER_NAMELEN]; /* Node name, hex string, sha1-size */
    int flags;      /* CLUSTER_NODE_... */
    uint64_t configEpoch; /* Last configEpoch observed for this node */
    // 记录哪些槽被本节点处理
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    uint16_t *slot_info_pairs; /* Slots info represented as (start/end) pair (consecutive index). */
    int slot_info_pairs_count; /* Used number of slots in slot_info_pairs */
    // 本节点处理多少槽
    int numslots;   /* Number of slots handled by this node */
    int numslaves;  /* Number of slave nodes, if this is a master */
    struct clusterNode **slaves; /* pointers to slave nodes */
    struct clusterNode *slaveof; /* pointer to the master node. Note that it
                                    may be NULL even if the node is a slave
                                    if we don't have the master node in our
                                    tables. */
    mstime_t ping_sent;      /* Unix time we sent latest ping */
    mstime_t pong_received;  /* Unix time we received the pong */
    mstime_t data_received;  /* Unix time we received any data */
    mstime_t fail_time;      /* Unix time when FAIL flag was set */
    mstime_t voted_time;     /* Last time we voted for a slave of this master */
    mstime_t repl_offset_time;  /* Unix time we received offset for this node */
    mstime_t orphaned_time;     /* Starting time of orphaned master condition */
    long long repl_offset;      /* Last known repl offset for this node. */
    char ip[NET_IP_STR_LEN];    /* Latest known IP address of this node */
    sds hostname;               /* The known hostname for this node */
    int port;                   /* Latest known clients port (TLS or plain). */
    int pport;                  /* Latest known clients plaintext port. Only used
                                   if the main clients port is for TLS. */
    int cport;                  /* Latest known cluster port of this node. */
    clusterLink *link;          /* TCP/IP link established toward this node */
    clusterLink *inbound_link;  /* TCP/IP link accepted from this node */
    list *fail_reports;         /* List of nodes signaling this as failing */
} clusterNode;

当客户端向节点发送键相关命令时，接收命令的节点会计算命令要处理的键属于哪个槽，并检查自身是否处理这个槽：

如果这个槽由自己处理，则执行命令
如果这个槽不是自己处理的，返回错误信息（包含处理槽处于哪个节点），客户端重新请求正确的节点

查看下节点信息的结构：

typedef struct clusterState {
    clusterNode *myself;  /* This node */
    uint64_t currentEpoch;
    int state;            /* CLUSTER_OK, CLUSTER_FAIL, ... */
    int size;             /* Num of master nodes with at least one slot */
    dict *nodes;          /* Hash table of name -> clusterNode structures */
    dict *nodes_black_list; /* Nodes we don't re-add for a few seconds. */
    // 记录槽指向的节点信息，节点信息包含了IP、端口等信息
    clusterNode *migrating_slots_to[CLUSTER_SLOTS];
    clusterNode *importing_slots_from[CLUSTER_SLOTS];
    clusterNode *slots[CLUSTER_SLOTS];
    rax *slots_to_channels;
    /* The following fields are used to take the slave state on elections. */
    mstime_t failover_auth_time; /* Time of previous or next election. */
    int failover_auth_count;    /* Number of votes received so far. */
    int failover_auth_sent;     /* True if we already asked for votes. */
    int failover_auth_rank;     /* This slave rank for current auth request. */
    uint64_t failover_auth_epoch; /* Epoch of the current election. */
    int cant_failover_reason;   /* Why a slave is currently not able to
                                   failover. See the CANT_FAILOVER_* macros. */
    /* Manual failover state in common. */
    mstime_t mf_end;            /* Manual failover time limit (ms unixtime).
                                   It is zero if there is no MF in progress. */
    /* Manual failover state of master. */
    clusterNode *mf_slave;      /* Slave performing the manual failover. */
    /* Manual failover state of slave. */
    long long mf_master_offset; /* Master offset the slave needs to start MF
                                   or -1 if still not received. */
    int mf_can_start;           /* If non-zero signal that the manual failover
                                   can start requesting masters vote. */
    /* The following fields are used by masters to take state on elections. */
    uint64_t lastVoteEpoch;     /* Epoch of the last vote granted. */
    int todo_before_sleep; /* Things to do in clusterBeforeSleep(). */
    /* Stats */
    /* Messages received and sent by type. */
    long long stats_bus_messages_sent[CLUSTERMSG_TYPE_COUNT];
    long long stats_bus_messages_received[CLUSTERMSG_TYPE_COUNT];
    long long stats_pfail_nodes;    /* Number of nodes in PFAIL status,
                                       excluding nodes without address. */
    unsigned long long stat_cluster_links_buffer_limit_exceeded;  /* Total number of cluster links freed due to exceeding buffer limit */
} clusterState;

参考资料

Redis官方文档
Linux写时复制机制原理
Redis专题：一文搞懂主从复制原理
Redis 7.0共享复制缓冲区的设计与实现
Redis(2)——数据结构之RAX

原文始发于微信公众号（erpang coding）：Redis持久化和高可用

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/37383.html

Redis持久化和高可用

目录

Reids 持久化

RDB（Redis Database）

简介

使用

优点

缺点

AOF（Append Only File）

简介

使用

优点

缺点

RDB + AOF

简介

使用

Redis高可用

主从架构

全量同步

部分同步

主从复制复制缓冲区（与AOF使用同一份缓冲区）源码分析

主从复制风暴

哨兵架构（待完善）

集群（待完善）

参考资料

发表回复

Redis持久化和高可用

目录

Reids 持久化

RDB（Redis Database）

简介

使用

优点

缺点

AOF（Append Only File）

简介

使用

优点

缺点

RDB + AOF

简介

使用

Redis高可用

主从架构

全量同步

部分同步

主从复制复制缓冲区（与AOF使用同一份缓冲区）源码分析

主从复制风暴

哨兵架构（待完善）

集群（待完善）

参考资料

相关推荐

发表回复

分享到: