Evan Harmon - Memex

Apache Kafka

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
wikipedia:: Apache Kafka
  • a distributed event store and stream-processing platform.
  • aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
  • uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
  • named the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.6(https://en.wikipedia.org/wiki/Apache_Kafka#cite_note-6)

Architecture

Streaming, as in near-real-time, getting data as it comes in as opposed to in chunks, delayed, etc.

Immutable, can't be changed
append-only, at the end of the topic

Processes called a producer

  • stores key-value messages that come from arbitrarily many processes called producers.
    consumer
    Topic
    Partitions

  • Partition within topic

  • Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion and has allowed it to replace some of the conventional messaging systems like Java Message Service (JMS), Advanced Message Queuing Protocol (AMQP), etc.
    Message
    Key-value
    Offset

  • Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp.
    Kafka Streams API
    Cluster run on servers called brokers

  • For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka.

  • Apache Kafka also works with external stream processing systems such as Apache Apex, Apache Beam, Apache Flink, Apache Spark, Apache Storm and Apache NiFi

Aiven Workshop

Apache Kafka
Interactive graph
On this page
Apache Kafka
Architecture
Aiven Workshop