Deep Dive Into Kafka HDFS Connect

Previously in this article, I wrote about Kafka Connect. Today, I'm going to get into the details of a type of Kafka Connect called Kafka HDFS Connect that usually comes pre-installed in the confluent distribution of Kafka. If not, it can be easily installed from the Confluent Hub by running the following command from the command line:

confluent-hub install confluentinc/kafka-connect-hdfs:latest

You can check all the connectors that are installed by:

confluent list connectors

As I said before, setting up a connector only involves writing a properties file and loading it to the Kafka. The properties that are available for Kafka HDFS Connect are here.

Below is properties file that I wrote that works to export JSON data from a topic in Kafka to a Kerberos Secured High Availability Hadoop cluster -

name=hdfs-sinkpageviews
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=pageviewsjson
hdfs.url=hdfs://nameservice1
# this determines after how many messages in Kafka 
# to write to a file in HDFS
flush.size=3 
# for HA HDFS. Needs path to hadoop conf directory
hadoop.conf.dir=/confluent-5.2.1/config/hadoop-conf

# for secured hdfs
hdfs.authentication.kerberos=true
# in my case, _HOST was nothing. so it was just
# kerberosuser@REALM.COM
connect.hdfs.principal=kerberosuser/_HOST@REALM.COM
connect.hdfs.keytab=/path/to/keytabs/kerberosuser.keytab
hdfs.namenode.principal=hdfs/<HOST URL OF HDFS USER>@REALM.COM

# where to write files
topics.dir=/user/kerberosuser/topics
logs.dir=/user/kerberosuser/logs
format.class=io.confluent.connect.hdfs.json.JsonFormat

# worker config
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.