Big Data Analytics
Around 15-20 years ago, organizations were a bit concerned about the automation of their business processes. Gradually, after putting some valuable endeavours, they have managed to make it a reality. Now with the maturity of their automations for over two decades, they have huge size of historical data collected over the years through various applications. The new worry, now, is how to analyze this huge data for better decision making purposes. I have come across a lot of discussions nowadays related to BIG DATA.
In this article, I will try to find some answers for the following questions:
- What actually big data is?
- How organizations are trying to utilize the complete potential of big data?
- What are the technologies available?
Having no conventional definition, Big-Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (https://en.wikipedia.org/wiki/Big_data).
We observe this Big-Data growth each day with with a fast pace, thanks to social media, online user activities, sensors data etc., as natural are the challenges that come with it.
Velocity, variety & volume (3Vs) of big data creates a unique kind of a problem. Digging into data, visualizing it and producing meaning out of it has become a major challenge across organizations (out of which, most are unaware!).
Thanks to a few technologies that have made this problem a bit easier to handle – easy in a sense that technologists, along with some data experts are able to dig into the data and can get a real meaning out of it. Some have gone to the extent of fetching even the future events out of it!
Below is a brief detail about two of such technologies helping and trying to solve the big data problem.
Hadoop is a platform for distributed, fault-tolerant storage & processing of large data sets spread across clusters of computing. It consists of computer clusters built from commodity hardwares (added benefit).
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing
Below is a snapshot depicting hadoop’s internals:
Apache Spark is a fast and general engine for large-scale data processing. Runs programs with 100 times faster than hadoop MapReduce in memory and 10 times faster on disk. (http://spark.apache.org/)
Comparison of Logistic Regression between hadoop and spark.(http://spark.apache.org/)
Alongside it’s core, Spark comes with a complete bundle of variety of technologies. You may take it as a general purpose processor being able to do DB queries, plus streaming jobs and a lot much of machine learning.
Ease of its use comes with the fact that it could be programmed in maybe java, scala python or even R. So you do not have to worry about that the language of your choice is not available provided you have knowledge of any one of them.
On top of this, it can be run on any cluster for example Hadoop, Mesos, standalone, or in the cloud. Likewise, It can access diverse data sources including HDFS, Cassandra, HBase, and S3 etc.
Snapshot showing different modules of spark.
Helps in querying any data using SQL like interface. Power is built to an extent that even a json file can be queried using it. Datasources could be any RDBMS, NoSQL, Hive etc.
Provide real time event processing through the api. Has the capability to stream text files, sockets, twitter, kafka etc.
With the increase of predictive analytics/data mining in current technology sphere, spark comes packaged with a really good machine learning library. With the power to build quickly onto algos. like neural networks, logistic regression, having the option to tune the model in a relative less time, spark has really given data scientists an art of having more meaning in a much less time.
GraphX is a new component in Spark for graphs and graph-parallel computation
Below is a small snippet for word count using spark python API. Just a few lines and all the work is done. Similarly, many more analytical problems (ETL etc) could be solved using a small bit of knowledge and understanding of the libraries.
text_file = spark.textFile(“hdfs://…”)
text_file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Ample choices are available in terms of technologies that could be used to tackle the same problem. Everyone has to choose according to the expertise and the kind of problem in hand. As the technology grows, and new features and enhancements are coming quite frequently, thanks to open-source community, organizations now have an option to build a data pipeline and further go towards advanced analytics both descriptive as well as predictive in nature. In my opinion, establishing a centralized data store using hadoop HDFS could be one good option. Once data is combined at a single place, it would be easy for organizations to manage and analyze it from there rather than considering disparate sources individually. With the power of spark giving really fast clustering, it can be utilized for analytics and further data processing.
Any comments and arguments are highly appreciated.