This Kafka Connect connector provides the capability to watch a directory for files and read the data as new files are written to the input directory. Each of the records in the input file will be converted based on the user supplied schema. The connectors in this project handle all different kinds of use cases like ingesting json, csv, tsv, avro, or binary files.
Running these connectors with multiple tasks requires a shared volume across all of the Kafka Connect workers. Kafka Connect does not have a mechanism for synchronization of tasks. Because of this each task will select which file it will use the following algorithm hash(<filename>) % totalTasks == taskNumber. If you are not using a shared volume this could cause issues where files are not processed. Using more than one task could also affect the order that the data is written to Kafka.
Each of the connectors in this plugin emit the following headers for each record written to kafka.
- file.path - The absolute path to the file ingested.
- file.name - The name part of the file ingested.
- file.name.without.extension - The file name without the extension part of the file.
- file.last.modified - The last modified date of the file.
- file.length - The size of the file in bytes.
- file.offset - The offset for this piece of data within the file.
The preferred method of installation is to utilize the Confluent Hub Client.
confluent-hub install jcustenborder/kafka-connect-spooldir:latest
- Compile the source code with mvn clean package
- Create a subdirectory called kafka-connect-spooldir under the plugin.path on your connect worker.
- Extract the contents of the zip file from target/components/packages/ to the directory you created in the previous step.
- Restart the connect worker.