The titles for the last post, “Flafka: Big Data Solution for Data Silos“, and now “The Flip of Flafka” are beginning to sound like an intentional tongue twister, but that is not the case. These tools are so tightly aligned and integrated that it only makes sense that their names would run together. As discussed in the last post, Flafka is the unofficial name given to running Kafka as a Flume sink. You will recall that Kafka has 1) Producers, 2) Consumers, 3) Connectors, and 4) Stream Processes, and Flume runs an Agent that consists of 1) Sources, 2) Channels, and 3) Sinks. Both of these concepts are depicted in Figure 1 below:
In this post, we will focus on running Flume as a consumer of Kafka, and we will use the Hadoop Distributed File System (HDFS) as the Flume sink (just as depicted in Figure 1 above). Last time we discussed the many advantages of this framework, chief among them being its flexibility, and power to ingest multiple disparate data sources, while at the same time being capable of providing for multiple consumers. You can literally look at this framework as a series of building blocks that fit nicely with one another. You can mix, and match as needed to build a data pipeline that matches your requirements. There is no need to modify business processes and requirements to match your out of the box, one size fits all vendor solution. And the best part is that this is a little more complicated than my grandsons’ Lego toys; well, maybe a little, but not much.
Instead of talking about the concepts, I am trying a new approach and demonstrating the concepts. Due to time constraints, the demonstration will not be terribly sophisticated, but keep in mind that this data flow process was assembled by one person in a few hours. There will be no Morphline data transformations during the Flume ingestion, and no Spark analysis of the streaming data from Kafka. However, I will demonstrate how a python script of approximately 50 lines of code can stream a live Twitter feed in as a Kafka producer, filtering on keywords and parsing out key attributes from the Twitter API. This data will then be passed to the Kafka Consumer, and picked up as a Flume Source. Flume will then pass this data into HDFS where it will be available for consumption by other applications. Again, not extremely sophisticated, but hopefully will demonstrate clearly enough the potential of this framework of technologies.
This was my first shot at making my own technical video, and hopefully it added to the discussion instead of being a distraction.
Leave a Reply
Your email is safe with us.