技术库 > 网站架构

Linux Ubuntu 16.04安装Hadoop

技术库:tec.5lulu.com

from:tec.5lulu.com

Hadoop was sponsored by Apache Software Foundation. It is a Cluster data Management Project.

Hadoop is a Java-based framework which manages the large data sets among the group of cluster machines.

It is very tough to configure the Cluster with Hadoop. But you can also install Hadoop on a single machine to perform some basic operations.

Hadoop May seem a single software but it has a lot of components behind it.

Here are they.

Hadoop Common:

We can say this as a big library which consists of utilities and libraries to support other Hadoop modules.

HDFS:

the Hadoop Distributed File system is responsible storing the data on the hard disk.

YARN:

YARN is the open source distributed processing framework and it stands for Yet Another Resource Negotiator.

MapReduce

Map reduce is a model for generating and processing big data sets in the cluster using parallel and distributed algorithms.

This is the base model but there are many models available for the updated Hadoop version 2.0.

Requirements

  • An Ubuntu 16.04 server configured according to the initial server setup guide.

Follow the guide and configure the server according to that. After that, you can go for installing the Hadoop and its dependencies.

The Hadoop requires Java to run. First, we will install Java and we will install Hadoop.

After that, we can configure the Hadoop and then we will run it.

Let us see how to install hadoop on ubuntu step by step in this tutorial.

Install Java

First, you have to update the package index.

$ sudo apt-get update

After that install java on your Ubuntu

$ sudo apt-get install java

Now, check the java version.

$ java -version

You will get the following output.

openjdk version "1.8.0_91" OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14- 3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Installing Hadoop

We have installed Java and then we have to install the Hadoop. Go to the Apache Hadoop release page to find the latest version of Apache.

Linux Ubuntu 16.04安装Hadoop,by 5lulu.com

You have to find the latest stable version to install hadoop 2.7 on ubuntu. Once you find the latest stable version and then copy the link by doing the right click.

Here we are going to install hadoop 2.7.3 on ubuntu.

Linux Ubuntu 16.04安装Hadoop,by 5lulu.com

Use the below command to download the file.

$ wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz 

== You will be redirected to the available mirror. So, the current URL may not match with the given one.==

We have to check whether the file has been altered while downloading.

For that, we will be doing an SHA - 256 checks.

Now, once again go back to the release page and then go to the Apache link.

Go to the Apache web directory. You have to find the .mds file for the version which you have downloaded.

Copy the link of that file and use that with wget as mentioned below.

$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz.mds 

After that, do the verification using the below command.

$ shasum -a 256 hadoop-2.7.3.tar.gz

You will get the following output.

d489df3808244b906eb38f4d081ba49e50c4603db03efd5e594a1e98b09259c2 hadoop-2.7.3.tar.gz

Now, check the SHA-256 value.

$ cat hadoop-2.7.3.tar.gz.mds

The both output should match.

~/hadoop-2.7.3.tar.gz.mds
...
hadoop-2.7.3.tar.gz: SHA256 = D489DF38 08244B90 6EB38F4D 081BA49E 50C4603D B03EFD5E 594A1E98 B09259C2
...

You can simply ignore the spaces. This way you can verify whether the file is corrupted while downloading.

Once you verified that file is original, then we have to use the tar command to extract the file.

$ tar -xzvf hadoop-2.7.3.tar.gz

Here:

-x is for extracting flag

-z is for uncompressing the file.

-v for verbose output

-f specifies the extraction from the file.

Now, we will move the extracted file to the /usr/local location.

$ sudo mv hadoop-2.7.3 /usr/local/hadoop

Now, the next step is to start the environment.

Configuring the Hadoop to use java

We have to configure the Hadoop to use the java either in Hadoop's configuration file or using the environmental variable.

Here /usr/bin/java and /etc/alternatives/java both are symlink to each other.

Here, we have to use the -f flag to follow the symlink in the every part of the path that is mentioned.

The sed will be used here to trim the path to get the bin/java. We have to do this to get the correct value of java Home from the output.

If you want to get the default java path.

$ readlink -f /usr/bin/java | sed "s:bin/java::" Output

/usr/lib/jvm/java-8-openjdk-amd64/jre/

We will set this version of Java to the Home path of Hadoop.

There is another way in which you can use the readlink command to dynamically set the path if you use any updated version.

First, Open the hadoop-env.sh:

$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

There are two options available. Here are they.

Setting Up a Static Value

/usr/local/hadoop/etc/hadoop/hadoop-env.sh
 . . . #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
. . .

Use the Readlink Directly

/usr/local/hadoop/etc/hadoop/hadoop-env.sh
. . . #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
. . .

Running Hadoop

We have set the Java Path and Now we can run the Hadoop.

Both Hadoop 2.7 multi-node cluster setup and Hadoop 2.6 multi-node cluster setup is complex and will be discussed in the upcoming article.

$ /usr/local/hadoop/bin/hadoop

You will get the following output.

Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME            run the class named CLASSNAME or where COMMAND is one of:
fs                   run a generic filesystem user client
version print the version
jar <jar> run a jar file
                   note: please use "yarn jar" to launch
                         YARN applications, not this command.
checknative [-a|-h]  check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath            prints the class path needed to get the
credential           interact with credential providers
                   Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon

If you get the above output, then it means that the Hadoop is up and running in the stand alone mode.

We will test whether it is configured properly or not.

We will use the example mapreduce program to test the Hadoop.

The first step is to create a directory called input in the home screen.

After that copy the Hadoop configuration to that file.

$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

Use the below command to run the Mapreduce hadoop-mapreduce-examples. It is a java archieve with a lot of programs inside.

We are going to use the grep program from the Mapdreduce program.

The Mapreduce program will work on counting the matches of words or regular expression.

We are going to find the occurrence of 'principal' at the end or before the declarative.

Since the expression is case sensitive, we could not find if it is capitalized.

/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep ~/input ~/grep_example 'principal[.]*' 

Once the process completed, you will get the following output.

Output
. . .
    File System Counters
            FILE: Number of bytes read=1247674 FILE: Number of bytes written=2324248 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework
            Map input records=2 Map output records=2 Map output bytes=37 Map output materialized bytes=47 Input split bytes=114 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=47 Reduce input records=2 Reduce output records=2 Spilled Records=4 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=61 Total committed heap usage (bytes)=263520256 Shuffle Errors
            BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters
            Bytes Read=151 File Output Format Counters
            Bytes Written=37 

If you get the following output, then it means that the output folder which is existing already.

Output
. . .
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

The results will be stored in the output directory and you can check it using cat.

$ cat ~/grep_example/* Output 6 principal 1 principal.

The output indicates that the word principal was found after six non-occurrence.

The example verifies that the installation has been done properly and the Hadoop is working well on the stand alone mode.

The user who is non-privileged can run Hadoop for exploring and debugging.


Linux Ubuntu 16.04安装Hadoop


本文链接 http://tec.5lulu.com/detail/105drn2i5r6wn8s46.html

我来评分 :6.1
0

转载注明:转自5lulu技术库

本站遵循:署名-非商业性使用-禁止演绎 3.0 共享协议

www.5lulu.com