Run Word-Count Example of Hadoop (Java Version)

The main content of this blog is introducing how to run word count example by Hadoop and Java on the cloud, including set up the Hadoop configuration, generate and adjust the Java code for Hadoop, and commands of running Hadoop. Before this, make sure that Linux has been installed on your cloud service, both Java and Hadoop are installed on your Linux system.

(1) Word Count Example in Local (Standalone) Mode

The Standalone mode refers to an independent Java process that runs on a host and runs in non-distributed mode by default. It uses a local file system instead of a distributed file system and does not require any Hadoop daemons to be loaded.

The default mode for Hadoop is standalone mode, so once hadoop is installed, we just need to put to data into input folder and run the command directly.

Firstly, we create the input folder on local filesystem, and then we download the data that can be used for word-count from internet and put it into the input folder.

1 2	`# On Ubuntu linux ‘wget’ command can be used to download data over http wget http://www.gutenberg.org/ebooks/20417.txt.utf-8`

After the data is ready, we can run the following command (there is only one command!) to run the canonical MapReduce example.

1	`/usr/local/hadoop-3.3.1/bin/hadoop jar /usr/local/hadoop-3.3.1/share/hadoop/mapreduce/hadoopexamples.jar wordcount /home/ubuntu/wordcount/input /home/ubuntu/wordcount/output`

The format of this command: /installation directory of Hadoop/bin/hadoop jar /installation directory of Hadoop/share/hadoop/mapreduce/hadoopexamples.jar wordcount /path of stored-data /path for the output

Note: the folder names “input” and “output” are arbitrary - you can choose any names you like for the input and output directories. But the “input” folder should contain the downloaded file. Note that the output folder is created by Hadoop, so you should not create it manually before running the command above.

(2) Word Count Example in Pseudo-Distributed Mode

The Pseudo-distributed mode refers to running on one host, using multiple Java processes to imitate various nodes that run in fully distributed mode but not actually be used in production. The pseudo-distributed mode has the main functions of the fully distributed mode.

Firstly, we need to edit the configuration file to switch Hadoop into pseudo-distributed mode. The two configuration files that are needed to be edited are 'core-site.xml' and 'hdfs-site.xml', and they are located on '/installation directory of Hadoop/etc/hadoop'.

# core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

# hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

We configurate the global parameters of cluster including HDFS URL and temporary directory of Hadoop by editing core-site.xml. The parameters of HDFS are stored in hdfs-site.xml, such as the storage location of the name node and data node, the number of file copies, the read permission of the file, etc.

After the configuration files are ready, let's check that if we can ssh to the localhost without a passphrase.

1	`$ ssh localhost`

If we cannot ssh to localhost without a passphrase, execute the following commands to setup passphraseless ssh

1
2
3

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

The next step is to use the following commands in the installation directory of Hadoop to start a MapReduce job in pseudo-distributed mode locally.

# Format the filesystem
$ bin/hdfs namenode -format
# Start NameNode daemon and DataNode daemon
$ sbin/start-dfs.sh
# Make the HDFS directories required to execute MapReduce jobs
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>

We can check if the job starts successfully by the command 'jps'.

$ jps
# if the following namenode, datanode, secondarynamenode are all shown, it means that the job has started
# 6818 DataNode
# 6651 NameNode
# 7067 SecondaryNameNode
# 91869 Jps

DataNode stores data block, NameNode is to accept the read and write requests from client and send them to DataNode, SecondaryNameNode helps NameNode merge edits log to reduce NameNode’s startup time.

(3) Create and Modify the Java code for Word Count Example

Now we write the Java code to count the words in the txt file that is downloaded in step (1), the goal is that the output of the job should thus be a file containing lines like: a number_of_words_starting_with_a_or_A

First of all, we create and edit the Java file in any path of the local filesystem we want.

1 2	`touch <Java file name>.java vim <Java file name>,java`

The Java code is copied from the official sample of Apache Hadoop, and then adjust the code based on the requirement above. The two key function mapper and reducer after adjustment are shown as below, the full code can be find in there

public static class FirstLetterCountMapper
	extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
	public void map(LongWritable key, Text value, Context context) 
		throws IOException, InterruptedException {
		# change the input data into string and make all the letters of it be lowercase
		String line = value.toString().toLowerCase();
		# split the string data into individual word by blankspace
		String[] strings = line. split(" ");
		for(int i=0; i<strings.length; i++) {
			# filter out empty strings
			if ( strings[i] != null && !strings[i].isEmpty() ){
				# extract the first letter of this word
				char firstLetter = strings[i].charAt(0);
					# find the words begin with a-z and deliver their first letter to reducer class
    				if (firstLetter >= 'a' && firstLetter <= 'z') {
        				context.write(new Text(String.valueOf(firstLetter)), new IntWritable(1));
					}
                }
            }
		}
    } 

public static class FirstLetterCountReducer
	extends Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable result = new IntWritable();
		@Override
		public void reduce(Text key, Iterable<IntWritable> values, Context context)  
			throws IOException, InterruptedException {
			int sum = 0;
			# count the number of words begin with a-z in a loop
			for (IntWritable val : values) {
				sum += val.get();
            }
			result.set(sum);
			context.write(key, result);
   	    }
    }

After the file edition is finished, we need to compile the wordcount example and make a jar file.

1
2
3

cd /directory of the folder that contains Java file
javac -cp `/installation directory of Hadoop/bin/hadoop classpath` <Java file name>.java
jar -cvf <Java file name>.jar  *.class

Now there is a file “.jar” in the same folder contains Java file. Next, load the file whose words we are counting into hdfs.

1	`/installation directory of Hadoop/bin/hdfs dfs -put /directory of the data we are going to process`

There should be a file in hdfs now, and we can check it by this command: /installation directory of Hadoop/bin/hdfs dfs -ls input

Then, we run the following command to run the Java code in Hadoop.

1	`/installation directory of Hadoop/bin/hadoop jar <Java file name>.jar <Java file name> input output`

If the hadoop run completes normally, verify that the output looks as expected. First check the content of the output directory in hdfs. Then check the content of the output file using the ‘-cat’ argument to ‘hdfs dfs’.

1 2	`/installation directory of Hadoop/bin/hdfs dfs -ls output /installation directory of Hadoop/bin/hdfs dfs -cat output/part-r-00000`

The output files are by default named part-x-yyyyy where: x is either 'm' or 'r', depending on whether the job was a map only job, or reduce. yyyyy is the mapper or reducer task number (zero based).

Data Engineering

Hadoop

All articles in this blog adopt the CC BY-SA 4.0 agreement except for special statements. Please indicate the source for reprinting!

Run Word-Count Example of Hadoop (Python Version) Previous

Show the Hidden File on Mac OS Next