Building Real Time Data Pipeline
In today’s fast paced IT world, where data is generated by almost every individual entity, be it machines, smart apps, legacy systems or any kind of web site. It has become mandatory for the organizations to get more meaning out of this mammoth data – one of the most critical factor for any business progression. To enable this available data help in business decisions, it’s essential that organizations should be able to analyze it. Over the years, organizations, with the help of expert vendors are trying to bridge the gap – from raw data to meaningful information, using different data warehousing techniques; providing various solutions and empowering businesses for decision making. Separate MIS departments (data scientist) are established which are responsible to dig into data and analyze it using different techniques. Using bar charts, pie-charts, line graphs, area plots etc. business managers are quite happy to look their data out of just a normal text or numbers!
With the maturity of data warehousing tools available in the market – allowing organizations to analyze data more sophisticatedly, there are organizations who stand one-step ahead. Rather than just doing traditional warehousing which often gives the real picture of your data after a lag of few hours or perhaps a day. People are more interested in knowing the ways where they are able to analyze their data AS IT ARRIVES. Having knowledge of what’s exactly happening in real time makes organizations even more powerful, making better decisions in quick time, reaping an advantage over their competitors. They can take prompt decisions, identify risks and manage trend analysis in a much faster and a better way.
Currently there is a lot of research going on in the traditional warehousing systems to provide real-time analysis; avoiding staging data, ETL etc. I will not discuss the concept of real time data warehousing in this article, rather I will try to build a concept of providing the same real time analytics solutions using some other available technologies.
After a bit of research in the field of big data, working with some of the tools available, I will explain how we can establish a Real Time Data Pipeline (RTDP) using the following tools combinations:
One approach is to have Apache Kafka, Hive and Spark connecting end-to-end from data sources to analytical dashboards. Below is a brief introduction of all three components mentioned:
Kafka is a message broker interface between systems. It’s based on publish/subscribe model used for streaming data like a messaging system. It has the concepts of topics for example we can create a topic weather_data and continuously transfer data from may be log files or some RDBMS to this topic. Topics are further divided into partitions.
SQL like interface on top of hadoop HDFS. Data can be stored in HDFS and can be queried using hive. Not so efficient for adhoc and analytical queries; some tweakings can be done to take the best performance out of it.
Spark is a general purpose cluster computing engine. Using spark to connect to Hive and query data with quite a better output in terms of performance, spark can be considered the final analytical layer doing all the ETL in memory and producing results to the dashboards/reports.
Another viable rather tedious job is to build a pipeline using apache ecosystem mainly Oozie along with Sqoop for RDBMS as a source, flume for log files etc. Oozie will be helpful in defining workflow jobs that will be responsible to do data transformations as well as load to the target data source.
Kafka connect is one tool to be looked for as well. It gives you the facility to develop your connectors both source and sink for data transfer. Either source or destination for kafka connect is always Kafka. It’s built on the idea to use kafka as a middleware between different source and destination data sources.
Very recently, I used the latest API introduced in the recent releases of apache spark named Structured Streaming. Looked to be on one promising API, it has its focused on Continuous Applications. Providing the power to build real time data pipelines using built-in functions along with the ability to do filtering, mapping, sorting etc. it’s a real good starting point to build your own real time applications. It has the ability to do late data calculations, provide fault tolerance and data prefix integrity just like a batch job. I will be covering more on this in some other article soon.
Highlighting a few options, let’s look at a few benefits that could be availed using these technologies.
1) Economy of the solution as these tools are quite capable of being deployed, and run on commodity hardwares giving organizations flexibility to invest much less in their infrastructures compared to other solutions.
2) Ability to build a total data warehouse. Total data warehouse is a kind of warehouse which encompasses all forms of data – structured, semi-structured and unstructured data. With the growing variety of data, these technologies can help organizations build a centralized data source helpful to deal with the new forms of data.
3) Most importantly, the ease with which real-time analytics is made possible!
Off course, with the advantages, there are a few challenges as well. Shortage of expert resources in this field is one of them. But in my opinion this is manageable – a lot of material is available online, many webinars, research papers, online training courses (paid and free) are there to help.
There can be many ways to build a real time data pipeline. Few opt for NoSQL or may be HDFS as their output, few might go for traditional RDBMS. Real implementation might well be highly dependant on the amount of production workload and the kind of analytics requirements per organization.
I have tried to cover, at a high-level, options that can be used to build a RTDP solutions. Anyone with a different kind of experience is welcomed to share the thoughts here.