通过官网我们可以知道
- kafka是一个事件流平台,具有:读写数据,设置数据保存时常,以及设定消费策略(实时、重复消费)的能力。
- kafka由server和client组成,他们之间通过TCP进行通讯。
- kafka有:生产者、消费者、topic、分区等主要概念,其中分区实现了:分区存储、容错性以及性能的提升。
- 可以通过kafka 提供的api实现:生产、消费、处理数据等功能。
- kafka一般用于:缓存/消峰、解耦和异步通信等场景中。
文章目录
1. Apache Kafka® is an event streaming platform. What does that mean?
kafka是一个事件流平台。kafka的三个能力:
- 发布(写)和订阅(读)事件流,包括从其他系统不断地导入/导出数据。
- 可以按照设定的数据保留时长来存储kafka数据
- 可以实时的消费产生的数据,或者重新消费已经产生的数据。
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:
- To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
- To store streams of events durably and reliably for as long as you want.
- To process streams of events as they occur or retrospectively.
2. How does Kafka work in a nutshell?
kafka是一个分布式系统,由server和client组成,两者通过TCP协议进行通讯。
servers:
kafka集群可以运行一个或多个server,server之间可以跨多个数据中心或云区域。
其中组成存储层的servers,称为broker;
kafka具有高拓展性和容错性:即当任何server故障时,其他的server将会接管它们的工作,确保在连续的操作下不丢失数据。
client:
client允许通过编写分布式应用或微服务,实现并行、大规模以及容错的方式处理(读、写、处理)流数据,甚至在网络问题或机器故障的情况下。
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.
Servers: Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.
Clients: They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.
3. Main Concepts and Terminology
a. 事件:
事件是记录的已经发生过的一些事情,也可以称为:record或message。
概念上来说,一个事件包含:key,value,时间戳,和可选择性的元数据头。
见下面的例子:
Event key: “Alice”
Event value: “Made a payment of $200 to Bob” Event timestamp: “Jun. 25, 2020 at 2:06 p.m.”
b. 生产者和消费者:
生产者是那些将事件写(发布)到kafka的应用,消费者订阅(读和处理)那些数据。
在kafka中,两者是完全解耦和与彼此无关的,这是kafka高拓展性的关键。例如:生产者从不等消费者。
kafka提供了许多,像exactly-once处理数据的能力。
c.事件的存储(topic)和kafka性能:
event被组织和持久化在topic中。 简单的说,一个topic类似于一个文件夹,events是文件夹中的文件。例如:账单。
topic对应多个消费者或生产者:
topic中一般都会有多个producer和subscriber:即一个主题可以有0个、1个或多个生产者向其写入事件,以及0个、1个或多个消费者订阅这些事件。
反复消费与保留时间:
事件可以根据需要随时读取,与传统消息传递系统不同,事件在使用后不会被删除。不过你可以设定每一个topic在kafka中的保留时间。
与数据量无关的性能:
kakfa的性能和存储的数据量没有关系,所以长时间的存储数据也可以保持良好的性能。
d. topic的分区
分区与并发:
topic被分区,意味着一个topic会散布在不同brokers中的buckets里。这种数据的分布式放置对于伸缩性很重要,因为它允许客户端的应用程序可以同时从多个broker中读写数据。(多少分区就可以有多少并发)
根据event的key写入不同的分区与只读指定的分区:
一条新的数据写到topic中时,其实只是添加到了topic中的一个分区。具体地,具有相同event key的event(例如客户或车辆id)会写到相同的分区。
kafka保证给定topic-partition的消费者始终读取该分区的event,并且它们写入的顺序相同。
When a new event is published to a topic, it is actually appended to one of the topic’s partitions. Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition’s events in exactly the same order as they were written.
看一个例子:
这个topic有四个分区。两个不同的分区,互相独立,发布新events到topic分区。相同的event-key将会写到相同的分区。
其中,如果合适的话,两个生产者可以写到一个分区中。
Note that both producers can write to the same partition if appropriate.
分区的容错性
为了保证数据的容错和高可用,每一个topic可以备份,即使是跨地理区域或数据中心的。 有多个broker拥有数据的副本,以防出错。
常见的生产设置,将复制因子设置为3。
这种复制是在topic-partition级别执行的。
To make your data fault-tolerant and highly-available, every topic can be replicated, even across geo-regions or datacenters, so that there are always multiple brokers that have a copy of the data just in case things go wrong, you want to do maintenance on the brokers, and so on. A common production setting is a replication factor of 3, i.e., there will always be three copies of your data. This replication is performed at the level of topic-partitions.
4. kafka api
Kafka有5个用于Java和Scala的核心api:
- Admin API :管理和检查topic、brokers、和其它kafka对象
- Producer API:发布流事件到一个或者多个kafka主题。
- Consumer API:订阅一个或多个主题,并处理产生于这些主题的流事件
- Kafka Streams API:实现流处理应用和微服务。提供更高一层的函数去处理流事件,包括转换,有状态的例如:聚合和join操作,window,基于事件时间处理,等等。 从一个或多个topics读取数据,并输出结果到一个或多个topic。
- Kafka Connect API:构建并运行可重用的数据导入导出连接器。这些连接器读写来自外部系统和应用程序的事件流,以便和kafka进行集成。
- The Admin API to manage and inspect topics, brokers, and other Kafka objects.
- The Producer API to publish (write) a stream of events to one or more Kafka topics.
- The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
- The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
- The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka.
5. 使用场景
- 缓冲和消峰:有助于控制和优化数据流经过系统的速度,解决生产消息和消费消息的处理速度不一致的情况。
- 解耦:允许独立的扩展或修改两边的处理过程,只要确保它们遵守同样的接口约束。
- 异步通信:允许用户把一个消息放入队列,但并不立即处理它,然后在需要的时候再去处理它们。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/65365.html