- Extract Transform Load
- Source Connector
- A Source connector is a connector that extends SourceConnector and is used by Kafka Connect to pull data into a Kafka Cluster. In the typical ETL pattern a SourceConnector would be used to extract data from a source system. Source systems can be anything from a relational database, to a remote web service.
- Sink Connector
- A Sink connector is a connector that extends SinkConnector and is used by Kafka Connect to pull data into a Kafka Cluster. In the typical ETL pattern a SinkConnector would be used to load data into a target system.
- Transformation
- A Transformation is used to make changes to data on the fly before it is written to Kafka or written to the target system. This could be masking a field, or setting the value of a field.
- Distributed Mode
- Distributed mode is a configuration of Kafka Connect that allows a user to configure multiple worker nodes to share the workload of multiple connectors. In the event of a failure of one of the worker nodes, the running connectors will be distributed among the remaining worker nodes. This allows a high degree of fault tolerance.
- Standalone Mode
- Standalone mode is a configuration of Kafka Connect that runs as a single node. This mode is typically used with listening style connectors like the syslog and splunk connectors. This is not a fault tolerant mode. If fault tolerance is required it must be handled externally to Kafka Connect. For example with the syslog connector you can place a load balancer in front of multiple standalone nodes.
- Kafka Topic
- A Kafka Topic is the logical layer of a Kafka Cluster. A topic is made of one or more partition(s) which can be spread across several kafka broker(s)
- Kafka Broker(s)
- A Kafka broker is a server in a Kafka Cluster which is used to store data.
- Partition(s)
- A partition is a logical breakdown of a Topic which allows the data to be hosted across several machines. Kafka does not have a global guaranteed order but it does guarantee order within a partition. Partitions are selected based on a hash of the key.
- A method for securing data between two hosts.
- Schema Registry
- The schema registry is a serializer and deserializer that integrates with Apache Avro to track the schemas associated with data in a topic. It also is used to ensure the proper schema is used when writing to a topic.
- Avro
- Apache Avro is a binary serialization system. It drastically reduces the overall size of the data written to disk, which directly correlates to less IO and storage requirements.
- Key
- The key is the the part of the message that is used to determine the partition that data will be written to. If the topic is configured as a compacted topic then only the last version of a key will be stored.
- Value
- The value is the part of the message that usually contains the bulk of the message. The value will typically contain fields with a representation of the data.
- Change Data Capture
- Compacted Topic
- A compacted topic is used to store the last value of data for a key. This is used heavily in CDC use cases.