Hadoop

$ hadoop job

[vagrant@localhost ~]$ hadoop job
WARNING: Use of this script to execute job is deprecated.
WARNING: Attempting to execute replacement "mapred job" instead.

Usage: job  
        [-submit ]
        [-status ]
        [-counter   ]
        [-kill ]
        [-set-priority  ]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW DEFAULT. In addition to this, integers also can be used.
        [-events   <#-of-events>]
        [-history [all]  [-outfile ] [-format ]]
        [-list [all]]
        [-list-active-trackers]
        [-list-blacklisted-trackers]
        [-list-attempt-ids   ]. Valid values for  are MAP REDUCE. Valid values for  are pending, running, completed, failed, killed
        [-kill-task ]
        [-fail-task ]
        [-logs  ]
        [-config  

Generic options supported are:
-conf         specify an application configuration file
-D                define a value for a given property
-fs  specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt   specify a ResourceManager
-files                 specify a comma-separated list of files to be copied to the map reduce cluster
-libjars                specify a comma-separated list of jar files to be included in the classpath
-archives           specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

Apache Hadoopのダウンロード・CentOSへのインストール

Apache software foundationのサイトから、Hadoopをダウンロードします。

apache
http://www.apache.org/dyn/closer.cgi/hadoop/common/

今回は、3.0.0
http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-3.0.0/

[vagrant@localhost ~]$ wget http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
--2018-01-23 16:31:38--  http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
ftp.tsukuba.wide.ad.jp をDNSに問いあわせています... 203.178.132.80, 2001:200:0:7c06::9393
ftp.tsukuba.wide.ad.jp|203.178.132.80|:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 306392917 (292M) [application/x-gzip]
`hadoop-3.0.0.tar.gz' に保存中

100%[======================================>] 306,392,917 1.66M/s 時間 2m 56s

2018-01-23 16:34:34 (1.66 MB/s) - `hadoop-3.0.0.tar.gz' へ保存完了 [306392917/306392917]

続いて、解凍します。

tar zxvf hadoop-3.0.0.tar.gz

解凍後、任意のパスに置きます。

sudo mv hadoop-3.0.0 /usr/local

最後に先ほど調べたjavaのディレクトリと移動したhadoopをJAVA_HOME、HADOOP_HOMEに設定する

[vagrant@localhost ~]$ export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0
[vagrant@localhost ~]$ export HADOOP_INSTALL=/usr/local/hadoop-3.0.0
[vagrant@localhost ~]$ export PATH=$HADOOP_INSTALL/bin:$JAVA_HOME/bin:$PATH

インストールの確認

[vagrant@localhost ~]$ hadoop version
Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533
Compiled by andrew on 2017-12-08T19:16Z
Compiled with protoc 2.5.0
From source with checksum 397832cb5529187dc8cd74ad54ff22
This command was run using /usr/local/hadoop-3.0.0/share/hadoop/common/hadoop-common-3.0.0.jar

来た！もう、今日はHadoopを肴に飲めます。

CentOSにHadoop : Javaのインストール先

javaのコマンドがどこから実行されているか確認

[vagrant@localhost ~]$ which java
/usr/bin/java

インスール先のディレクトリを確認

[vagrant@localhost ~]$ ls -la /usr/bin/java
lrwxrwxrwx. 1 root root 22 11月 21 16:18 2016 /usr/bin/java -> /etc/alternatives/java

cpuの情報

[vagrant@localhost ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 78
model name      : Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
stepping        : 3
cpu MHz         : 2399.996
cache size      : 3072 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc up rep_good xtopology nonstop_tsc unfair_spinlock pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed
bogomips        : 4799.99
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

カーネル情報

[vagrant@localhost ~]$ uname -a
Linux localhost.localdomain 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CentOSでHadoop – javaのバージョン確認

Hadoopを利用するには、Javaの1.6以降が必要となります。
Hadoopコミュニティーでは、JDKの利用を推奨しています。
Vagrantにログインして、ヴァージョンを確認します。

[vagrant@localhost ~]$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)

/mapreduce/Mapper.java

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>{
	protected void map(KEYIN key, VALUEIN value,
		Context context) throws IOException, InterruptedException {
		context.write((KEYOUT) key, (VALUEOUT) value);
	}
}

Mapperで引数を受け取った時点で、keyとvalueを引数として別々に受け取っている。

HashPartitioner.java

package org.apache.hadoop.mapreduce.lib.partition;

import org.apache.hadoop.mapreduce.Partitioner;

public class HashPartitioner<K, V> extends Partitioner<K, V>{
	public int getPartition(K key, V value, int numReduceTasks){
		return (Key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
	}
}

ユーザー数のカウント処理

user_id_list = []

ARGF.each_line do |line|
line.chomp!

user_id = line.split(‘,’)[0]

unless user_id_list.include?(user_id)
user_id_list << user_id end end puts "UU数は#{user_id_list.size}です" [/code]

ReducerCode

def reducer():
	salesTotal = 0
	oldKey = None

	for line in sys.stdin:
		data = line.strip().split("\t")

		if len(data) != 2
			continue

		thisKey, thisSale = data

		if OldKey, thisSale = data
			print "{0}\t{1}".format(oldKey, salesTotal)

			salesTotal = 0

		oldKey = thisKey
		salesTotal += float(thisSale)

Defensive Mapper

def mapper():
	for line in sys.stdin:
		data = line.strip().split("\t")
		date, time, store, item, cost, payment = data
		print "{0}\t{1}".format(store, cost)

The reason to use big data is it’s too big to store in one machine.Challenges with big data is data is created fast and data from different source in various formats.

Hadoop
Store in HDFS
process with MAPREDUCE

Hadoop ecosystem
pig, hive … select * from
mapreduce, impala, hbase
HDFS <- sqoop, flume Hue, oozie, mahout Cloudera is a distribution of Hadoop(CDH) Hadoop picks three node as random.