July 30, 2021

by Dev 0

Introduction:

In the world of big data one of the biggest concerns of a person is that can his/her machine or a server handle the massive amount of data that is going to be processed and how much time will it take to process the data. Normally data engineers use Apache Hadoop and Apache Hive to store data and run query on it but in case of performing complex operations on query like Aggregations, Joins, etc.. It takes a lot of time and the query performance is just worse moreover your machine or server starts lagging badly. So to overcome this problem we use different query engines and big data tools to optimize Hadoop and its components.

As a big data enthusiast, I was having the same problems when I was working on Hadoop and Hive so when I was searching for possible solutions for my problem I found an incredible tool called Apache Ignite that will optimize Hadoop and Hive for better performance.

So in this post, I will demonstrate Apache Ignite and how to integrate it with Hadoop and Hive for better performance.

What is Apache Ignite?

Apache Ignite is a general-purpose, in-memory data fabric. Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability. Ignite supports any SQL-based RDBMS, NoSQL, Amazon S3, and Hadoop HDFS as optional data sources. It powers both existing and new applications in a distributed, massively parallel architecture on affordable, industry-standard hardware.

Basically there are two main components of Ignite that are widely used for big data tools

Memory Centric Platform
In-Memory Hadoop Accelerator

Since I am working on Hadoop so I will discuss Ignite Hadoop Accelerator integration with Hadoop and hive.

In-Memory Hadoop Accelerator:

In-Memory Hadoop Accelerator enhances existing Hadoop technology to enable fast data processing using the tools and technology your organization is already using today. The GridGain Accelerator for Hadoop is based on the dual-mode, high-performance in-memory Ignite File System (IGFS) which is 100% compatible with Hadoop HDFS, and an in-memory optimized MapReduce implementation. Ignite Hadoop Accelerator provides a set of components allowing for in-memory Hadoop job execution and file system operations.

Ignite MapReduce:

Ignite has a separate map-reduce framework known as Ignite MR which enhances mapreduce jobs performance. Hadoop Accelerator ships with alternate high-performant implementation of job tracker which replaces standard Hadoop MapReduce. Use it to boost your Hadoop MapReduce job execution performance.

IGFS In-Memory File System:

Ignite has its own In-Memory file system known as IGFS.Hadoop Accelerator ships with an implementation of Hadoop File Sytem which stores file system data in-memory using distributed Ignite File System (IGFS). Use it to minimize disk IO and improve performance of any file system operations. One of the greatest benefits of the IGFS is that it does away with Hadoop NameNode in the Hadoop deployment, it seamlessly utilizes Ignite’s in-memory database under the hood to provide completely automatic scaling and failover without additional storage.

Secondary File System:

Secondary File System is an integral part of Ignite Hadoop Accelerator. Secondary file system can be injected into existing IGFS allowing for read-through and write-through behavior over any other Hadoop file system implementation (e.g. HDFS). We can use it if we want our IGFS to become an in-memory caching layer over disk-based HDFS or any other Hadoop-compliant file system.

Ignite Hadoop Accelerator Configuiration with Secondary File System as HDFS:

Now as far most important part of this blog I will demonstrate how to configure Apache Ignite over Hadoop while using HDFS as Secondary File System. Configuring Ignite over Hadoop is not very complex so just follow these simple to steps to configure it on your machine:

Download Ignite Hadoop Accelrator version 2.6 binary from Apache Ignite official Website. After Downloading the zip file unzip it on your desired location.
Set IGNITE_HOME variable to the directory where you unpacked Apache Ignite Hadoop Accelerator. Note that you have to set IGNITE_HOME in your ~/.bashrc file.
Ensure that HADOOP_HOME environment variable is set and valid. This is required for Ignite to find necessary Hadoop classes.
Since we are caching data from HDFS so we need to set HDFS as secondary file system. Open $IGNITE_HOME/config/default-config.xml, uncomment secondaryFileSystem property and set correct HDFS URI:

<bean class=”org.apache.ignite.configuration.FileSystemConfiguration”>
…
<property name=”secondaryFileSystem”>
<bean class=”org.apache.ignite.hadoop.fs.IgniteHadoopIgfsSecondaryFileSystem”>
<property name=”fileSystemFactory”>
<bean class=”org.apache.ignite.hadoop.fs.CachingHadoopFileSystemFactory”>
<property name=”uri” value=”hdfs://your_hdfs_host:9000/”/>
</bean>
</property>
</bean>
</property>
</bean>

5.Copy or symlink Ignite JARs to Hadoop classpath. This is required to let Hadoop load Ignite classes in runtime.

cd $HADOOP_HOME/share/hadoop/common/lib
ln -s $IGNITE_HOME/libs/ignite-core-[version].jar
ln -s $IGNITE_HOME/libs/ignite-shmem-1.0.0.jar
ln -s $IGNITE_HOME/libs/ignite-hadoop/ignite-hadoop-[version].jar

6. Hadoop determines what file system and job tracker to use based on configuration files, core-site.xml and mapred-site.xml respectively.

The recommended way to set up this configuration is to create separate directory, copy existing core-site.xmland mapred-site.xml files there, and then apply necessary configuration changes. For example:

mkdir ~/ignite_conf
cd ~/ignite_conf
cp $HADOOP_HOME/etc/hadoop/core-site.xml .
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml .

7. Please add class name mapping to core-site.xml to the directory you just created.

<configuration>
…
<property>
<name>fs.igfs.impl</name>
<value>org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.igfs.impl</name>
<value>org.apache.ignite.hadoop.fs.v2.IgniteHadoopFileSystem</value>
</property>
…
</configuration>

8.Use Ignite Hadoop Accelerator for map-reduce jobs, point mapred-site.xml to proper job tracker.

<configuration>
…
<property>
<name>mapreduce.framework.name</name>
<value>ignite</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>[your_host]:11211</value>
</property>
…
</configuration>

At this point installation is finished and you can start running jobs or work with IGFS. Go to IGNITE_HOME/bin and run “ ./ignite.sh”

Now Test a map-reduce Wordcount job on Ignite.

Ignite Clustering:

Ignite nodes can automatically discover each other. This helps to scale the cluster when needed, without having to restart the whole cluster. Developers can also leverage Ignite’s hybrid cloud support that allows establishing a connection between private and public clouds such as Amazon Web Services, providing them with the best of both worlds.

Apache Ignite notes discovery mechanism goes with two implementations intended for different usage scenarios:

TCP/IP Discovery designed and optimized for 10s and 100–300 of nodes deployments.
ZooKeeper Discovery allows scaling Ignite clusters to 100s and 1000s of nodes preserving linear scalability and performance.

Since I configured Ignite cluster through TCP/IP Discovery so I will guide you through TCP/IP Discovery configuration.

Ignite provides TcpDiscoverySpi as a default implementation of DiscoverySpi that uses TCP/IP for node discovery. Discovery SPI can be configured for Multicast and Static IP based node discovery.

Multicast IP Finder:

I used Multicast IP finder to discover other nodes in the grid. Here is an example of how to configure this finder in your default-config.xml file.

<bean class=”org.apache.ignite.configuration.IgniteConfiguration”>
…
<property name=”discoverySpi”>
<bean class=”org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi”>
<property name=”ipFinder”>
<bean class=”org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder”>
<property name=”multicastGroup” value=”Enter your desired node ip here”/>
</bean>
</property>
</bean>
</property>
</bean>

I made cluster on two nodes so I configured the multicast ip finder on both nodes. At master’s node config file slave node’s ip will be entered and at slave’s node config file master’s node ip will be entered so that both nodes can discover each other.

Cluster Groups with Node Attributes:

The unique characteristic of Ignite is that all grid nodes are equal. There are no master or server nodes, and there are no worker or client nodes either. All nodes are equal from Ignite’s point of view — however, users can configure nodes to be masters and workers, or clients and data nodes.

All cluster nodes, on startup, automatically register all the environment and system properties as node attributes. However, users can choose to assign their own node attributes through Ignite configuration:

<bean class=”org.apache.ignite.IgniteConfiguration”>
…
<property name=”userAttributes”>
<map>
<entry key=”ROLE” value=”worker”/>
</map>
</property>
…
</bean>

Clients and Servers:

Note that Ignite has an optional notion of client and server nodes. Server nodes participate in caching, compute execution, stream processing, etc., while the native client nodes provide ability to connect to the servers remotely. Ignite native clients allow using the whole set of Ignite APIs, including near caching, transactions, compute, streaming, services, etc. from the client-side.

By default, all Ignite nodes are started as server nodes, and client mode needs to be explicitly enabled.

You can configure a node to be either a client or a server in default-config.xml file:

<bean class=”org.apache.ignite.configuration.IgniteConfiguration”>
…
<! — Enable client mode. →
<property name=”clientMode” value=”true”/>
…
</bean>

Finally, you should have configured Ignite Cluster now. Run ignite and check that ignite would be indicating that there are two server nodes or 1 server and 1 client node based on how you set both nodes on to be client or servernode.

Hive with Ignite MR:

If you’re following my guide correctly, by now Ignite Hadoop Accelerator is now configured properly. Since we also want to optimize query performance on Hive we will also configure Hive with Apache Ignite so that the map-reduce job on hive work on Ignite’s framework. Note that as my personal experience Ignite Hadoop Accelerator version 2.6 is compatible to Apache Hive version 2.1 since I tried integrating Hadoop Accelerator with Hive version 3.1, I was experiencing a lot of errors while running on hive ignited.

First of all, open your default Hadoop’s mapred-site.xml file, the default map-reduce framework would be “yarn” in that file, replace “yarn” with “ignite” so that Ignite MapReduce engine can be used for MR jobs.
Create a bash script namely “hive-ig” so that when the script runs hive uses ignite’s framework as MapReduce jobs. To create scipt :

nano hive-ig.sh (for Linux users)

# Specify Hive home directory:
export HIVE_HOME=<Hive installation directory>

# Specify configuration files location:
export HIVE_CONF_DIR=<Path to our configuration folder>

# If you did not set hadoop executable in PATH, specify Hadoop home explicitly:
export HADOOP_HOME=<Hadoop installation folder>

# Avoid problem with different ‘jline’ library in Hadoop:
export HADOOP_USER_CLASSPATH_FIRST=true

${HIVE_HOME}/bin/hive “${@}”

3. Save the bash script where ever you want and run hive metastore service. After running metastore service, do not run hive on hive’s bin directory. Instead run “hive-ig” script so that hive runs on IgniteMR.

At this point, your Hadoop and Hive are both set on Ignite. However, still there are a lot of features in Ignite that can be added like Ignite Persistence moreover Ignite can be integrated with Apache Spark but that’s the topic for another blog. I hope my blog will help you to easily configure Ignite with Hadoop and Hive so this is it for now. Thank you 🙂

Categories: Business, Carporate Solutions, Digital Strategies, Workshops

Tags: hadoop, ignite, igniteMR, mapreduce

Share us on:

Hadoop Integration with Apache Ignite | Using Hive with IgniteMR.

Introduction:

What is Apache Ignite?

In-Memory Hadoop Accelerator:

Ignite MapReduce:

IGFS In-Memory File System:

Secondary File System:

Ignite Hadoop Accelerator Configuiration with Secondary File System as HDFS:

Ignite Clustering:

Multicast IP Finder:

Cluster Groups with Node Attributes:

Clients and Servers:

Hive with Ignite MR:

Add your Comment

Recent Posts

Recent Comments

Archives

Categories

Meta

Categories

Search

Facebook Page

Hadoop Integration with Apache Ignite | Using Hive with IgniteMR.

Introduction:

What is Apache Ignite?

In-Memory Hadoop Accelerator:

Ignite MapReduce:

IGFS In-Memory File System:

Secondary File System:

Ignite Hadoop Accelerator Configuiration with Secondary File System as HDFS:

Ignite Clustering:

Multicast IP Finder:

Cluster Groups with Node Attributes:

Clients and Servers:

Hive with Ignite MR:

Add your Comment

Recent Posts

Recent Comments

Archives

Categories

Meta

Categories

Popular Tags

Search