Previously in this article, I wrote about Kafka Connect. Today, I'm going to get into the details of a type of Kafka Connect called Kafka HDFS Connect that usually comes pre-installed in the confluent distribution of Kafka. If not, it can be easily installed from the Confluent Hub by running the following command from the command line:
confluent-hub install confluentinc/kafka-connect-hdfs:latest
You can check all the connectors that are installed by:
confluent list connectors
As I said before, setting up a connector only involves writing a properties file and loading it to the Kafka. The properties that are available for Kafka HDFS Connect are here.
Below is properties file that I wrote that works to export JSON data from a topic in Kafka to a Kerberos Secured High Availability Hadoop cluster -
name=hdfs-sinkpageviews connector.class=io.confluent.connect.hdfs.HdfsSinkConnector tasks.max=1 topics=pageviewsjson hdfs.url=hdfs://nameservice1 # this determines after how many messages in Kafka # to write to a file in HDFS flush.size=3 # for HA HDFS. Needs path to hadoop conf directory hadoop.conf.dir=/confluent-5.2.1/config/hadoop-conf # for secured hdfs hdfs.authentication.kerberos=true # in my case, _HOST was nothing. so it was just # kerberosuser@REALM.COM connect.hdfs.principal=kerberosuser/_HOST@REALM.COM connect.hdfs.keytab=/path/to/keytabs/kerberosuser.keytab hdfs.namenode.principal=hdfs/<HOST URL OF HDFS USER>@REALM.COM # where to write files topics.dir=/user/kerberosuser/topics logs.dir=/user/kerberosuser/logs format.class=io.confluent.connect.hdfs.json.JsonFormat # worker config key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.kafka.connect.json.JsonConverter value.converter.schemas.enable=false