Redis 创始人开源极短消息压缩算法：SMAZ2

注意：此库不再与旧版本Smaz兼容（旧版本库中仍然可用）。它经过重新设计，对不可压缩的输入更具抵抗力（它很少放大输入文本，几乎从来没有）。它也比以前压缩得更好。

动机

LoRa网络的带宽非常有限，每条消息都需要很长的通道时间才能传输（通常是几秒钟！）。当LoRa用于在人类之间发送消息时，一种压缩形式以合理的方式提高了信道利用率。

此压缩方案旨在压缩内存极度受限的设备中的小消息，例如运行MicroPython的ESP 32设备。其基本思想是使用预先计算的二元组和单词表来更有效地编码短消息，总RAM使用量小于2kbytes。

单词表由256个单词组成。短字（len小于4字节）不存在，因为它们更好地编码为二元组。这是256个单词的完整列表：

"that", "this", "with", "from", "your", "have", "more", "will", 
"home", "about", "page", "search", "free", "other", "information", 
"time", "they", "site", "what", "which", "their", "news", "there", 
"only", "when", "contact", "here", "business", "also", "help", 
"view", "online", "first", "been", "would", "were", "services", 
"some", "these", "click", "like", "service", "than", "find", 
"price", "date", "back", "people", "list", "name", "just", 
"over", "state", "year", "into", "email", "health", "world", 
"next", "used", "work", "last", "most", "products", "music", "data", 
"make", "them", "should", "product", "system", "post", "city", 
"policy", "number", "such", "please", "available", "copyright", 
"support", "message", "after", "best", "software", "then", "good", 
"video", "well", "where", "info", "rights", "public", "books", 
"high", "school", "through", "each", "links", "review", "years", 
"order", "very", "privacy", "book", "items", "company", "read", 
"group", "need", "many", "user", "said", "does", "under", "general", 
"research", "university", "january", "mail", "full", "reviews", 
"program", "life", "know", "games", "days", "management", "part", 
"could", "great", "united", "hotel", "real", "item", "international", 
"center", "ebay", "must", "store", "travel", "comments", "made", 
"development", "report", "member", "details", "line", "terms", 
"before", "hotels", "send", "right", "type", "because", "local", 
"those", "using", "results", "office", "education", "national", 
"design", "take", "posted", "internet", "address", "community", 
"within", "states", "area", "want", "phone", "shipping", "reserved", 
"subject", "between", "forum", "family", "long", "based", "code", 
"show", "even", "black", "check", "special", "prices", "website", 
"index", "being", "women", "much", "sign", "file", "link", "open", 
"today", "technology", "south", "case", "project", "same", "pages", 
"version", "section", "found", "sports", "house", "related", 
"security", "both", "county", "american", "photo", "game", "members", 
"power", "while", "care", "network", "down", "computer", "systems", 
"three", "total", "place", "following", "download", "without", 
"access", "think", "north", "resources", "current", "posts", "media", 
"control", "water", "history", "pictures", "size", "personal", 
"since", "including", "guide", "shop", "directory", "board", 
"location", "change", "white", "text", "small", "rating", "rate", 
"government"

如果没有找到匹配的单词，则使用二元语法表。该表由最常见的128个按频率排列的二元组组成，总共256个字节：

intherreheanonesorteattistenntartondalitseediseangoule
comeneriroderaioicliofasetvetasihamaecomceelllcaurla
chhidihofonsotacnarssoprrtsassusnoiltsemctgeloeebet
rnipeiepancpooldaadviunamutwimoshyoaiewowosfiepttmi
opiaweagsuiddoooirspplscaywaigeirylytuulivimabty

当连一个匹配的bigram都找不到时，值为0或在9到127范围内的字节可以用一个字节编码（例如所有ASCII字符，符号，数字…）。字节值可以保持原样。

对于范围为1到8和128到255的字节，将生成转义序列，并发出1到5个逐字字节。值为6，7，8的字符串用作特殊转义，从表中发出一个字。使用值6。

编码

编码的工作原理

值为128到255的字节编码ID为0到127的二元组。
值为0或从9到127的字节就是它本身。
值为6的字节后面跟着一个表示要发出的字ID的字节。
值为7的字节类似于6，但在单词之后会发出一个空格。
值为8的字节类似于6，但在单词之前会发出一个空格。
值为1到5的字节意味着后面跟着1到5个逐字字节。

这意味着这种压缩方案只在使用逐字字节时才会使用比输入字符串更多的空间，也就是说，当字符串包含特殊或Unicode字符时，这些字符的字节值恰好在1到8之间：这在实践中是非常罕见的情况，即使发生这种情况，它也几乎总是由其他单词补偿。

只要消息是具有共同统计属性的拉丁字母自然语言消息，程序就永远不会使用比所需更多的空间，并且通常能够将单词压缩到更少的字节。然而，使用这种方案的程序可能在报头中有一个一位的标志，以表明消息是否被压缩，因此每次结果都大于未压缩的消息时，不能使用压缩来传输消息。

实现了真实的压缩

./smaz2 “The program is designed to work well with English text”

压缩长度：44.44%

./smaz2 “As long as the messages are latin letters natural language messages with common statistical properties, the program will only seldom use more space than needed”

压缩长度：54.72%

./smaz2 “Anche se in maniera meno efficiente, questo algoritmo di compressione è in grado di comprimere testi in altre lingue.”

压缩长度：66.95%

实施方式

在这个仓库中，你会发现一个C和一个Python实现。该实现针对空间（RAM和可执行代码）进行了优化，而不是针对速度，因为大多数用例很少使用它，只有在发送短消息时才会使用它。因此，该算法在每个字符串位置扫描表，这通常是非常昂贵的，但对于这个库来说仍然足够了。

原文始发于微信公众号（开源技术小栈）：Redis 创始人开源极短消息压缩算法：SMAZ2

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/247769.html

Redis 创始人开源极短消息压缩算法：SMAZ2

动机

编码

实现了真实的压缩

实施方式

相关推荐

发表回复

分享到: