Deep Dive Into HDFS Kafka Connect
By Diwanshu Shekhar
- 1 minutes read - 205 wordsPreviously in this article, I wrote about Kafka Connect. Today, I’m going to get into the details of a type of Kafka Connect called Kafka HDFS Connect that usually comes pre-installed in the confluent distribution of Kafka. If not, it can be easily installed from the Confluent Hub by running the following command from the command line:
confluent-hub install confluentinc/kafka-connect-hdfs:latest
You can check all the connectors that are installed by:
confluent list connectors
As I said before, setting up a connector only involves writing a properties file and loading it to the Kafka. The properties that are available for Kafka HDFS Connect are here.
Below is a properties file that I wrote that works to export JSON data from a topic in Kafka to a Kerberos Secured High Availability Hadoop cluster -
name=hdfs-sinkpageviews
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=pageviewsjson
hdfs.url=hdfs://nameservice1
# this determines after how many messages in Kafka
# to write to a file in HDFS
flush.size=3
# for HA HDFS. Needs path to hadoop conf directory
hadoop.conf.dir=/confluent-5.2.1/config/hadoop-conf
# for secured hdfs
hdfs.authentication.kerberos=true
# in my case, _HOST was nothing. so it was just
# kerberosuser@REALM.COM
connect.hdfs.principal=kerberosuser/_HOST@REALM.COM
connect.hdfs.keytab=/path/to/keytabs/kerberosuser.keytab
hdfs.namenode.principal=hdfs/<HOST URL OF HDFS USER>@REALM.COM
# where to write files
topics.dir=/user/kerberosuser/topics
logs.dir=/user/kerberosuser/logs
format.class=io.confluent.connect.hdfs.json.JsonFormat
# worker config
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false