Deep Dive into Apache Kafka: Your go-to Event Streaming Framework.
Understand what makes Kafka a de-facto standard for event streaming platforms.
Apache Kafka: is a “distributed Commit Log” used for solving high-end scalable problems with high throughput, low latency, fault-tolerance, horizontal scalability, and exceptional Event handling capabilities to build real-time data pipelines and streaming engines.
Architecture
Kafka is all about data streaming and the bare bones are Brokers, Topics, Consumers, and Producers with redundant storage of massive data volumes and a message bus capable of throughput reaching billions of messages per second.
Let's break down the architecture inside out,
1. Event:
is an information log of something that happened. Event is a key-value pair message along with other attributes like timestamp, meta information, and headers.
2. Topic:
Kafka Topics are Sequential collections of immutable events(objects) which are continually appended as logs and distributed as partitions with First-In-First-Out guarantee.
topics in Kafka are similar to database tables but do not contain all database constraints. we can create any desired number of topics as we want, there is no limit to the number. name is the identifier for a topic, naming convention depends completely on the user. you can store events in the topics indefinitely or temporarily based on the requirement, and Kafka topics will act just like database tables, thanks to Kafka's durability nature.
Command to create Kafka Topics:bin/kafka-topics.sh --create --topic eventPool --bootstrap-server localhost:9092 --paritions 2 --replication-factor 1
Partitions:
partitions play an important role in Kafka's concurrency mechanism. load scaling in Kafka is achieved by breaking a topic into multiple partitions. Each topic can have 1 or more partitions, with no limits on the number. but there are practical limits. The size of partitions is limited to what can fit on a single node. increase the number of partitions if you have more data in a topic than can fit on a single node. Partitions can have copies to increase durability and availability.
Replication factor:
A replication factor is the number of copies of data over multiple brokers to secure data loss. Kafka uses replication for failover. You need a replication factor of at least 3 to recover failures. Kafka guarantees every partition replica resides on a different broker (whether if it is the leader or a follower), so the maximum replication factor is the broker's number.
log segments:
topic partition data is split into log segments. There is a file on disk for each log segment. messages are appended to one of the currently active segments. There are size and time constraints on the segment, once the threshold is reached the corresponding file will be closed and a new one will be auto opened. Kafka can purge and compact messages from the non-active segments by splitting the data of a topic partition into segments.
Data retention:
Data retention depends on per-topic configuration parameters and is controlled by the Kafka server. The retention of the data can be controlled by the following criteria:
- The size limit of the data being held for each topic partition.
- The amount of time that data is to be retained.
When either of these limits is reached, Kafka purges the data from the system.
3. Broker:
Kafka brokers are Kafka servers responsible for the exchange of information. i,e receiving records from producers, assigning offsets to them, and committing/writing the records to partition logs.
Command to start Kafka Broker Service:bin/kafka-server-start.sh config/server.properties
A broker is like a container to hold topics inside it including the partitions.
Brokers are masterminds in a way, similar to parent/child or master/slave hierarch. in a sense, connection with one broker implies you are connected to all brokers forming a cluster. broker does not contain whole data, but each broker in the cluster knows about all other brokers, partitions as well as topics. so each node in the cluster knows when a new broker joined, a Broker died, a topic was removed/added, etc.
Each Kafka Broker has a unique ID (number).
Kafka Brokers contain topic log partitions. for safety and backup, you want to start with at least three to five brokers. A Kafka cluster can have thousands of brokers in a cluster if needed.
Leader:
Every partition has exactly one partition leader which handles all the read/write requests of that partition.
If the replication factor is greater than 1, the additional partition replications act as partition followers.
Every partition follower is reading messages from the partition leader (acts like a kind of consumer) and does not serve any consumers of that partition.
Sync:
Partition follower is considered in-sync if it keeps reading data from the partition leader without lagging behind and without losing connection to ZooKeeper (max lag default is 10 seconds and ZooKeeper timeout is 6, they are both configurable). If the partition follower is lagging behind, it is considered out-of-sync.
Leader election:
When a partition leader shuts down for any reason one of it’s in-sync partition followers becomes the new leader.
brokers are often employed in the following ways:
- Financial transactions and payment processing.
- E-commerce order processing and fulfillment.
- Protecting highly sensitive data at rest and in transit.
4. Consumer:
Kafka consumers are responsible for reading records from one or more topics and partitions.
Command to create Kafka Consumer:bin/kafka-console-consumer.sh --topic eventPool --from-beginning --bootstrap-server localhost:9092
Offset:
Kafka maintains an offset for each message in a partition. the last offset that has been saved is the committed position. during fail or restart, data is recovered based on the offset. A consumer in Kafka can either automatically commit offsets periodically, or it can choose to control this committed position manually.
group coordinator:
is one of the brokers, which receives heartbeats (or polling for messages) from all consumers of a consumer group. Every consumer group has a group coordinator. If a consumer stops sending heartbeats, the coordinator will trigger a rebalance. It gives you more flexible/extensible assignment policies without rebooting the broker.
5. Producer:
Kafka producers are responsible for appending streams of records to one or more topics/partitions over a group of clusters.
Command to create Kafka Producer:bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
Flushing:
flush()
Is a method that waits for all messages in the Producer queue to be delivered. Data is not flushed immediately when it is written to log segment, Kafka utilizes the lazy flush feature of the underlying OS to flush the data to disk, which in turn results in better performance.flush()
will block until the previously sent messages have been delivered (or errored), effectively making the producer synchronous.
Polling:
poll()
is a method that Polls the producer for events and calls the corresponding callbacks (if registered).
Message ordering:
happens per partition only in kafka.
All messages in Kafka are stored and delivered in the order in which they are received regardless of how busy the consumer side is. Kafka provides ordering within a partition and does not guarantee the ordering of messages between partitions. A message cannot be sent with a priority level, nor be delivered in priority order.
Acknowledgments(ACK):
before the write is considered complete a producer client can choose how many replicas to wait to receive the data in memory. producer client provides configuration to set up ACK.
one can configure characteristics of acknowledgment on the producer can be done,
- It doesn’t wait for a reply/ack (
ACKs=0
) - It waits for a reply to say that the leader broker has received the message (
ACKs=1
) - It waits for a reply to say that all the in-sync replica brokers have received the message
ACKs=all
)
In the case where there are 3 replicas (like in the diagram above), if the producer has specified ACKS=all
it will wait for all 3 replicas to receive the data. Thus, once the producer has written the data if a broker fails, there are 2 remaining brokers that also have copies of the data. When the failed broker comes back online, there is a period of “catch up” until it is in sync with the other brokers. However, with ACKs=0
or ACKs=1
, there is a greater risk of losing data if a broker failure occurs before the data has reached all of the replicas.
Good to know:
- Kafka uses a custom protocol, on top of TCP/IP for communication between applications and the cluster.
- Kafka does not support routing.
- You can create dynamic routing yourself with help of Kafka streams, where you dynamically route events to topics, but it’s not a default feature.
- Kafka is designed for holding and distributing large volumes of messages.
- Kafka retains large amounts of data with very little overhead.
- only the partition leader serves read/writes.