Kafka - Docs

Integrating microservices via Kafka.

Overview

Kafka is a publish-subscribe messaging system designed for event-driven architectures. It is widely utilized in large-scale microservices environments to facilitate real-time communication of events across multiple services and other real-time applications. Kafka's has the ability to handle high-throughput, low-latency data streams reliably, making it ideal for scenarios requiring real-time data processing and scalable event-driven workflows.

Purpose

Within Codr, Kafka serves three primary purposes:

Event Streams: Kafka enables the continuous flow of events between different components within Codr and with external systems. This capability supports real-time data processing and event-driven workflows, ensuring timely updates and interactions between Codr's services and external entities.
State Streams: Kafka facilitates the management of stateful data within Codr and its integration with external systems. It maintains and updates the current state of entities, ensuring consistency and synchronization across services and external interfaces.
Dead Letter Streams: Dead letter streams in Kafka within Codr are crucial for error handling. They capture messages that cannot be processed initially and provide mechanisms for retrying or managing errors, thereby enhancing the reliability and resilience of event-driven processes within Codr and external systems.

The publish-subscribe model of Kafka empowers Codr to efficiently produce, consume, and process data in real time, supporting dynamic scalability and responsiveness across its interactions with both internal components and external systems.

Abstract message model

The message interface below is designed to be extended from so that all messages published within Codr are consistent. A message consists of a list of headers and a record. Headers are used to identify the message and related metadata. The record is the serialized payload of the message.

interface IMessageHeaders {
  key: ObjectId;          // The message key, this should always be the entity key held by the record. 
  traceId: UUID;          // A unique identifier to help trace data flows across systems. 
  processedAt: Date;      // The timestamp the message was processed in ISO format.

  topic?: string;         // Kafka adds the topic the message is produced to.
  partion?: number;       // Kafka adds the partition to the headers.
  offset?: number;        // Kafka adds the offset to the headers.
};
interface IMessage<T> {
  headers: IMessageHeaders;
  record: T;
}

Event topics

When an entity has had a CRUD operation preformed (create, read, update, delete), an event is produced to an event stream. These are primarily utilized by the audit and notification systems, but designed in an abstract way so that it can be used any type of workflow pattern.

Event message model

The event record interface below is modeled for the auditability use case first, but should be abstract enough to be used for any workflow use case.

/**
 * This is the content of the message record
 */
interface IEventRecord<Entity> {
  type: Resource;           // Where: the entity type that was modified.
  action: Action;           // How:   the CRUD action that was taken.
  userId: ObjectId;         // Who:   which user made the change.
  payload: Partial<Entity>; // What:  what data got modified, a subset of the entity.
  // "When" is expressed from the extended IMessage interface.
}

type IEventMessage = IMessage<IEventRecord<Entity>>;

Use cases

Auditing: When an event occurs, a message expressing the state change is produced. These messages are consistent across all systems so that the auditing service can consume all messages without error.
Automation: One way that Codr uses Kafka for automation is message delivery via event streams. An example is when an organization administrator creates a new user, the notification system can consume and process the event message to generate a welcome email and send it to the new user.

State topics

Unlike event topics, state topics contain the entire contents of an entity.

State message model

The state record interface below only has a payload object which contains the full entity state.

interface IStateRecord<Entity> {
  payload: Entity;
}

type IStateMessage = IMessage<IStateRecord<Entity>>;

Use cases

Data syncing: because state streams contain the entire contents of an entity, these messages can be use to write data directly to the database. An example for a good use of this is streaming in bulk data such as a 100k+ sample dataset.

Dead letter topics

Dead letter topics, commonly referred to as DLTs, is a stream that consists of failed events. This is an error handling concept used with Kafka that if used correctly can save data changes from being lost.

Dead letter message model

The dead letter interface below is designed to contain the entire original message plus some extra metadata to facilitate reprocessing.

type SteamType = "state" | "event";
interface IDeadLetterHeaders extends IMessageHeaders {
  stream: StreamType;   // the type of stream that threw an exception
  topic: string;        // the topic the message errored from
}

interface IDeadLetterMessage<Entity> = {
  // the extends headers compared to a normal message.
  headers: IDeadLetterHeaders;
  // the original record produced.
  record: IEventRecord<Entity> | IStateRecord<Entity>;
}

Use cases

Alerting: to create supportability for the system, DLTs in conjunction with observability systems such as Grafana and Prometheus can be used to trigger alerts once a threshold has been hit.
Reprocessing: in an event that a bug is causing messages to be sent to the DLT topics, the DLT topic can be used to reprocess errored entities once a hotfix or regular release has been released.

Partitioning

Some notes about partitioning.

Message keys

Each message produced to a topic is assigned a "key," this key should be the ObjectId of the entity being processed. This messagekey is important as this effects what partition a message goes to and how the retention policy functions.

A key can only be assigned to one partition only. Once a partition is assigned, a message with that key will always go to that partition.
Depending on the retention policy, the key can play an important role. For topics with a compact policy, the topic with compact the messages on the topic to latest message produced for each key. Meaning, if a topic with a one week compact retention policy has a key produced to it 3 times, then a week passes, only the newest message is kept to disk.

Conventions

Topic Naming

Topic naming convention: codr.{state|event}.{entity}, where {state|event} depicts what type of topic it is and {entity} is the name of the resource.

There might be a use case in the future for "workflow" topics, but the current ideology is that events should be consumed by other services to trigger further processing, creating workflow that way. For the event in which we find that workflow topics are needed, the naming convention should be: codr.workflow.{entity}.{task} where {task} is the name of the task that will process the data.

Dead letters should be service specific. The naming convention for DLTs should be: codr.svc.{service}.dlt where service is the name of a domain (e.g. user for user domain) or the name of an external system (e.g. mongo for MongoDB). An example of a DLT name could be: codr.svc.organization.dlt for the organization domain dead letter stream.

For entities or tasks that have spaces in their names, use hyphens/dashes (-) rather than spaces. For example, the "User Preference" entity should be represented as user-preference and in the context of a state topic name, it should look like codr.state.user-preference.

Topic Sizing

Depending on the predicted throughput of messages on a topic, we may want to have more or less partitions. A good rule of thumb is that we should have at least one partition per service replica. For instance, if we have 8 annotation service replicas, we should have at minimum 8 partitions.

Because of how partitioning work, the number of partitions should be configured with the projected scalabilty expected of the system.

At a very minimum, every service in Codr shall have 2 replicas, meaning there should be at least 2 partitions.

Topic Retention

Retention policies are important to setup when creating topics. Here are a few good rules:

State topics should be set to compact after one week.
Event and dead letter topics should be delete after two weeks.

Definitions

Topic: A named data stream to which records (messages) are published by producers. Consumers subscribe to one or more topics to consume records.
Message: A unit of data in Kafka. It represents the information published to a topic by a producer and consumed by consumers.
Producer: A client application that publishes records (messages) to Kafka topics.
Consumer: A client application that subscribes to topics and processes the feed of published messages.
Partition: A partition is a unit of parallelism in Kafka. Each topic is divided into one or more partitions, and each partition can be hosted on a different broker. Partitions allow Kafka to scale horizontally and handle large amounts of data.
Offset: A unique identifier given to each message within a partition. Offsets are used by consumers to keep track of their position in the partition's log.
Log: An ordered, append-only sequence of records (messages) that are stored on disk. Each topic in Kafka is divided into one or more logs, where each log corresponds to a partition of the topic. The log retains messages for a configurable retention period or until a size limit is reached.
Consumer Group: A group of consumer instances that jointly consume a set of subscribed topics. Each message within a partition is delivered to only one consumer instance within a consumer group.
Broker: A Kafka server responsible for storing and managing the topics, handling requests from producers and consumers, and maintaining the replicated log.
Replication: Kafka provides replication of partitions across multiple brokers for fault tolerance. Each partition has one leader and one or more followers, ensuring that data is not lost even if a broker fails.