Big Data Analytics
Around 15-20 years ago, organizations were a bit concerned about the automation of their business processes. Gradually, after putting some valuable endeavours, they have managed to make it a reality. Now with the maturity of their automations for over two decades, they have huge size of historical data collected over the years through various applications. The new worry, now, is how to analyze this huge data for better decision making purposes. I have come across a lot of discussions nowadays related to BIG DATA.
In this article, I will try to find some answers for the following questions:
- What actually big data is?
- How organizations are trying to utilize the complete potential of big data?
- What are the technologies available?
Having no conventional definition, Big-Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (https://en.wikipedia.org/wiki/Big_data).
We observe this Big-Data growth each day with with a fast pace, thanks to social media, online user activities, sensors data etc., as natural are the challenges that come with it.
Velocity, variety & volume (3Vs) of big data creates a unique kind of a problem. Digging into data, visualizing it and producing meaning out of it has become a major challenge across organizations (out of which, most are unaware!).
Thanks to a few technologies that have made this problem a bit easier to handle – easy in a sense that technologists, along with some data experts are able to dig into the data and can get a real meaning out of it. Some have gone to the extent of fetching even the future events out of it!
Below is a brief detail about two of such technologies helping and trying to solve the big data problem.
Hadoop
Hadoop is a platform for distributed, fault-tolerant storage & processing of large data sets spread across clusters of computing. It consists of computer clusters built from commodity hardwares (added benefit).
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing
Below is a snapshot depicting hadoop’s internals:
Spark
Apache Spark is a fast and general engine for large-scale data processing. Runs programs with 100 times faster than hadoop MapReduce in memory and 10 times faster on disk. (http://spark.apache.org/)
Comparison of Logistic Regression between hadoop and spark.(http://spark.apache.org/)
Alongside it’s core, Spark comes with a complete bundle of variety of technologies. You may take it as a general purpose processor being able to do DB queries, plus streaming jobs and a lot much of machine learning.
Ease of its use comes with the fact that it could be programmed in maybe java, scala python or even R. So you do not have to worry about that the language of your choice is not available provided you have knowledge of any one of them.
On top of this, it can be run on any cluster for example Hadoop, Mesos, standalone, or in the cloud. Likewise, It can access diverse data sources including HDFS, Cassandra, HBase, and S3 etc.
Snapshot showing different modules of spark.
Spark SQL
Helps in querying any data using SQL like interface. Power is built to an extent that even a json file can be queried using it. Datasources could be any RDBMS, NoSQL, Hive etc.
Spark Streaming
Provide real time event processing through the api. Has the capability to stream text files, sockets, twitter, kafka etc.
MLlib
With the increase of predictive analytics/data mining in current technology sphere, spark comes packaged with a really good machine learning library. With the power to build quickly onto algos. like neural networks, logistic regression, having the option to tune the model in a relative less time, spark has really given data scientists an art of having more meaning in a much less time.
GraphX
GraphX is a new component in Spark for graphs and graph-parallel computation
Below is a small snippet for word count using spark python API. Just a few lines and all the work is done. Similarly, many more analytical problems (ETL etc) could be solved using a small bit of knowledge and understanding of the libraries.
text_file = spark.textFile(“hdfs://…”)
text_file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Ample choices are available in terms of technologies that could be used to tackle the same problem. Everyone has to choose according to the expertise and the kind of problem in hand. As the technology grows, and new features and enhancements are coming quite frequently, thanks to open-source community, organizations now have an option to build a data pipeline and further go towards advanced analytics both descriptive as well as predictive in nature. In my opinion, establishing a centralized data store using hadoop HDFS could be one good option. Once data is combined at a single place, it would be easy for organizations to manage and analyze it from there rather than considering disparate sources individually. With the power of spark giving really fast clustering, it can be utilized for analytics and further data processing.
Any comments and arguments are highly appreciated.
Recent Posts
- The task of dating professional services – How expertise can streamline your love everyday life
- Dating site for starters – Simple actions to implementing online dating expertise
- How you can find partner – Get going with FilipinoCupid adult dating support
- The Basic Secrets and Techniques for Online Dating Website Pages – Getting Boyfriend Made Easy
- The Easy Help Guide to Internet Dating Web Sites – Obtaining Boyfriend Made Simple
Recent Comments
Archives
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- July 2021
- June 2021
- May 2021
- April 2021
- February 2021
- January 2021
- December 2020
- October 2020
- September 2020
- July 2020
- June 2020
- May 2020
- March 2020
- December 2017
- January 1970