Posts
Getting Started With Kafka
Apache Kafka is an open-source framework that allows you to develop real-time applications. In this article, I will jot down some points that may help you save some time and frustration if you’re just learning about Apache Kafka. First of all, to setup a development Kafka environment, it will save you a lot of hassle if you just use confluent distribution of Kafka as opposed to the native Apache version. Download the Confluent Platform from https://docs.
Posts
Comparing SQL, Pandas and Spark
Most of us are familiar with writing database queries with SQL. But there are also other ways you can query your data from the database or from a file directly. One way is through a Python package called Pandas or through Apache Spark. Both of them are very popular these days in the Data Science field. If you can fit your data in memory in a single computer, I’d suggest to use Pandas.
Posts
Deep Dive Into HDFS Kafka Connect
Previously in this article, I wrote about Kafka Connect. Today, I’m going to get into the details of a type of Kafka Connect called Kafka HDFS Connect that usually comes pre-installed in the confluent distribution of Kafka. If not, it can be easily installed from the Confluent Hub by running the following command from the command line:
confluent-hub install confluentinc/kafka-connect-hdfs:latest You can check all the connectors that are installed by: