Run Word-Count Example of Hadoop (Java Version)
The main content of this blog is introducing how to run word count example by Hadoop and Java on the cloud, including set up the Hadoop configuration, generate and adjust the Java code for Hadoop, and commands of running Hadoop. Before this, make sure that Linux has been installed on your cloud service, both Java and Hadoop are installed on your Linux system.
(1) Word Count Example in Local (Standalone) Mode
The Standalone mode refers to an independent Java process that runs on a host and runs in non-distributed mode by default. It uses a local file system instead of a distributed file system and does not require any Hadoop daemons to be loaded.
The default mode for Hadoop is standalone mode, so once hadoop is installed, we just need to put to data into input folder and run the command directly.
Firstly, we create the input folder on local filesystem, and then we download the data that can be used for word-count from internet and put it into the input folder.
1 |
|
After the data is ready, we can run the following command (there is only one command!) to run the canonical MapReduce example.
1 |
|
- The format of this command: /installation directory of Hadoop/bin/hadoop jar /installation directory of Hadoop/share/hadoop/mapreduce/hadoopexamples.jar wordcount /path of stored-data /path for the output
Note: the folder names “input” and “output” are arbitrary - you can choose any names you like for the input and output directories. But the “input” folder should contain the downloaded file. Note that the output folder is created by Hadoop, so you should not create it manually before running the command above.
(2) Word Count Example in Pseudo-Distributed Mode
The Pseudo-distributed mode refers to running on one host, using multiple Java processes to imitate various nodes that run in fully distributed mode but not actually be used in production. The pseudo-distributed mode has the main functions of the fully distributed mode.
Firstly, we need to edit the configuration file to switch Hadoop into pseudo-distributed mode. The two configuration files that are needed to be edited are 'core-site.xml' and 'hdfs-site.xml', and they are located on '/installation directory of Hadoop/etc/hadoop'.
1 |
|
1 |
|
- We configurate the global parameters of cluster including HDFS URL and temporary directory of Hadoop by editing core-site.xml. The parameters of HDFS are stored in hdfs-site.xml, such as the storage location of the name node and data node, the number of file copies, the read permission of the file, etc.
After the configuration files are ready, let's check that if we can ssh to the localhost without a passphrase.
1 |
|
If we cannot ssh to localhost without a passphrase, execute the following commands to setup passphraseless ssh
1 |
|
The next step is to use the following commands in the installation directory of Hadoop to start a MapReduce job in pseudo-distributed mode locally.
1 |
|
We can check if the job starts successfully by the command 'jps'.
1 |
|
- DataNode stores data block, NameNode is to accept the read and write requests from client and send them to DataNode, SecondaryNameNode helps NameNode merge edits log to reduce NameNode’s startup time.
(3) Create and Modify the Java code for Word Count Example
Now we write the Java code to count the words in the txt file that is downloaded in step (1), the goal is that the output of the job should thus be a file containing lines like: a number_of_words_starting_with_a_or_A
First of all, we create and edit the Java file in any path of the local filesystem we want.
1 |
|
The Java code is copied from the official sample of Apache Hadoop, and then adjust the code based on the requirement above. The two key function mapper and reducer after adjustment are shown as below, the full code can be find in there
1 |
|
After the file edition is finished, we need to compile the wordcount example and make a jar file.
1 |
|
Now there is a file “
1 |
|
- There should be a file in hdfs now, and we can check it by this command: /installation directory of Hadoop/bin/hdfs dfs -ls input
Then, we run the following command to run the Java code in Hadoop.
1 |
|
If the hadoop run completes normally, verify that the output looks as expected. First check the content of the output directory in hdfs. Then check the content of the output file using the ‘-cat’ argument to ‘hdfs dfs’.
1 |
|
- The output files are by default named part-x-yyyyy where: x is either 'm' or 'r', depending on whether the job was a map only job, or reduce. yyyyy is the mapper or reducer task number (zero based).
All articles in this blog adopt the CC BY-SA 4.0 agreement except for special statements. Please indicate the source for reprinting!