$ hadoop job

[vagrant@localhost ~]$ hadoop job
WARNING: Use of this script to execute job is deprecated.
WARNING: Attempting to execute replacement "mapred job" instead.

Usage: job  
        [-submit ]
        [-status ]
        [-counter   ]
        [-kill ]
        [-set-priority  ]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW DEFAULT. In addition to this, integers also can be used.
        [-events   <#-of-events>]
        [-history [all]  [-outfile ] [-format ]]
        [-list [all]]
        [-list-attempt-ids   ]. Valid values for  are MAP REDUCE. Valid values for  are pending, running, completed, failed, killed
        [-kill-task ]
        [-fail-task ]
        [-logs  ]

Generic options supported are:
-conf         specify an application configuration file
-D                define a value for a given property
-fs  specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt   specify a ResourceManager
-files                 specify a comma-separated list of files to be copied to the map reduce cluster
-libjars                specify a comma-separated list of jar files to be included in the classpath
-archives           specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

Apache Hadoopのダウンロード・CentOSへのインストール

Apache software foundationのサイトから、Hadoopをダウンロードします。



[vagrant@localhost ~]$ wget http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
--2018-01-23 16:31:38--  http://ftp.tsukuba.wide.ad.jp/software/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
ftp.tsukuba.wide.ad.jp をDNSに問いあわせています..., 2001:200:0:7c06::9393
ftp.tsukuba.wide.ad.jp||:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 306392917 (292M) [application/x-gzip]
`hadoop-3.0.0.tar.gz' に保存中

100%[======================================>] 306,392,917 1.66M/s 時間 2m 56s

2018-01-23 16:34:34 (1.66 MB/s) - `hadoop-3.0.0.tar.gz' へ保存完了 [306392917/306392917]


tar zxvf hadoop-3.0.0.tar.gz


sudo mv hadoop-3.0.0 /usr/local


[vagrant@localhost ~]$ export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0
[vagrant@localhost ~]$ export HADOOP_INSTALL=/usr/local/hadoop-3.0.0
[vagrant@localhost ~]$ export PATH=$HADOOP_INSTALL/bin:$JAVA_HOME/bin:$PATH


[vagrant@localhost ~]$ hadoop version
Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533
Compiled by andrew on 2017-12-08T19:16Z
Compiled with protoc 2.5.0
From source with checksum 397832cb5529187dc8cd74ad54ff22
This command was run using /usr/local/hadoop-3.0.0/share/hadoop/common/hadoop-common-3.0.0.jar


CentOSにHadoop : Javaのインストール先


[vagrant@localhost ~]$ which java


[vagrant@localhost ~]$ ls -la /usr/bin/java
lrwxrwxrwx. 1 root root 22 11月 21 16:18 2016 /usr/bin/java -> /etc/alternatives/java


[vagrant@localhost ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 78
model name      : Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
stepping        : 3
cpu MHz         : 2399.996
cache size      : 3072 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc up rep_good xtopology nonstop_tsc unfair_spinlock pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch rdseed
bogomips        : 4799.99
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:


[vagrant@localhost ~]$ uname -a
Linux localhost.localdomain 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CentOSでHadoop – javaのバージョン確認


[vagrant@localhost ~]$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)


public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>{
	protected void map(KEYIN key, VALUEIN value,
		Context context) throws IOException, InterruptedException {
		context.write((KEYOUT) key, (VALUEOUT) value);



package org.apache.hadoop.mapreduce.lib.partition;

import org.apache.hadoop.mapreduce.Partitioner;

public class HashPartitioner<K, V> extends Partitioner<K, V>{
	public int getPartition(K key, V value, int numReduceTasks){
		return (Key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;


user_id_list = []

ARGF.each_line do |line|

user_id = line.split(‘,’)[0]

unless user_id_list.include?(user_id)
user_id_list << user_id end end puts "UU数は#{user_id_list.size}です" [/code]


def reducer():
	salesTotal = 0
	oldKey = None

	for line in sys.stdin:
		data = line.strip().split("\t")

		if len(data) != 2

		thisKey, thisSale = data

		if OldKey, thisSale = data
			print "{0}\t{1}".format(oldKey, salesTotal)

			salesTotal = 0

		oldKey = thisKey
		salesTotal += float(thisSale)

Defensive Mapper

def mapper():
	for line in sys.stdin:
		data = line.strip().split("\t")
		date, time, store, item, cost, payment = data
		print "{0}\t{1}".format(store, cost)


The reason to use big data is it’s too big to store in one machine.Challenges with big data is data is created fast and data from different source in various formats.

Store in HDFS
process with MAPREDUCE

Hadoop ecosystem
pig, hive … select * from
mapreduce, impala, hbase
HDFS <- sqoop, flume Hue, oozie, mahout Cloudera is a distribution of Hadoop(CDH) Hadoop picks three node as random.