Deep Dive Into Kafka HDFS Connect

Previously in this article, I wrote about Kafka Connect. Today, I'm going to get into the details of a type of Kafka Connect called Kafka HDFS Connect that usually comes pre-installed in the confluent distribution of Kafka. If not, it can be easily installed from the Confluent Hub by running the following command from the command line:

confluent-hub install confluentinc/kafka-connect-hdfs:latest

You can check all the connectors that are installed by:

confluent list connectors

As I said before, setting up a connector only involves writing a properties file and loading it to the Kafka. The properties that are available for Kafka HDFS Connect are here.

Below is properties file that I wrote that works to export JSON data from a topic in Kafka to a Kerberos Secured High Availability Hadoop cluster -

# this determines after how many messages in Kafka 
# to write to a file in HDFS
# for HA HDFS. Needs path to hadoop conf directory

# for secured hdfs
# in my case, _HOST was nothing. so it was just
# kerberosuser@REALM.COM
hdfs.namenode.principal=hdfs/<HOST URL OF HDFS USER>@REALM.COM

# where to write files

# worker config

Getting Started With Apache Kafka

Apache Kafka is an open-source framework that allows you to develop real-time applications. In this article, I will jot down some points that may help you save some time and frustration if you're just learning about Apache Kafka. First of all, to setup a development Kafka environment, it will save you a lot of hassle if you just use confluent distribution of Kafka as opposed to the native Apache version. Download the Confluent Platform from The confluent distribution of Kafka comes with Confluent CLI that streamlines the administrative work of setting up Kafka server. It also comes with a Control Center that allows you to monitor the Kafka server from a browser. The coolest thing is all the enterprise level features are available for free for a single broker Kafka server.

How to quickly generate data to your Kafka Topic

Although you can use Kafka Connect Datagen Connector to generate, but I find the command line way the the quickest way to generate data to your Kafka topics. All you need to do is run the following command and boom. You have the data in your Kafka Topic.

ksql-datagen quickstart=users format=avro topic=topic3 maxInterval=100

You can also generate data in a json format by using json as a value for the format parameter. You can also provide pageviews as a value to the quickstart parameter to generate a different kind of data.

Understanding Kafka Connect

Kafka Connect is part of Apache Kafka®, providing streaming integration between data stores and Kafka. The figure below shows architectural diagrams for importing data from MySQL to Kafka and from Kafka to HDFS.

A converter sits right before the enters or leaves Kafka. When data is ingested into Kafka, the converter converts the data in specific format to byte[]. When data is exported out of the Kafka, it converts the data in byte[] format to the format it was used to originally in before it was imported into Kafka. Common converters include:

  • io.confluent.connect.avro.AvroConverter
  • org.apache.kafka.connect.json.JsonConverter
  • org.apache.kafka.connect.converters.ByteArrayConverter
  • com.blueapron.connect.protobuf.ProtobufConverter

Converters can set by setting value.converter parameter in the properties files. While consuming JSON data from Kafka, if the data doesn't have a schema attached to it, you should tell Kafka not to look for schema by setting value.converter.schemas.enable=false in the properties file

To see what parameters are available for the Kafka Connect: go here

Commonly used Confluent CLI commands

# start all servers for Kafka
confluent start

# check log a particular service such as Kafka Connect
confluent log connect

# consume from a topic from command line
confluent consume mytopic5 --from-beginning

# check installed connectors
confluent list connectors

# unload a connector such as hdfs-sink
confluent unload hdfs-sink

# load a connector
confluent load <name of connector> -d <path to properties file>

# check status of loaded connectors
confluent status connectors

Application Development

So far, I talked about the administrative aspects of Kafka. But, to develop Kafka application, you need have a good understanding of the concept of Producer and Consumer and Kafka Streams. I recommend two books - Kafka: the Definitive Guide and Kafka Streams In Action that will get you going on your application development. Additionally, this webpage from confluent is a good resource to go through.