Run Word-Count Example of Hadoop (Python Version)
While Hadoop/MapReduce is based on Java, it is not necessary to use Java to write the mapper and reducer. The Hadoop framework provides the “Streaming API”, which lets us use any command line executable that reads from standard input and writes to standard output as the mapper or reducer. This tutorial (Link), although a bit old, provides an excellent introductory example to using Python and Hadoop streaming.
(1) Case Description
In this blog, I'm going to use a word-count example to show how to run MadReduce task by Python. The data I used in this example is twitter data (json), and the goal is to count the occurrence of a list of words. To achieve the goal, I would revise the Python code in the tutorial above.
(2) Adjust the Code For Python-Hadoop-Example
The tutorial above uses mapper.py and reducer.py to run MapReduce task but these two files are in Python 2.7, which is out of date. Therefore, I would change it into Python 3.+ format and adjust it according to the case requirement.
1 |
|
1 |
|
(3) Run the Code on Hadoop
Before we actually run the task on Hadoop, it's better to have a local test about mapper.py and reducer.py on small data set to check if it's able to run successfully. The command for local test is given as below.
1 |
|
If we can get the expected result on the test, then we can run the Hadoop task. First, we put the data set into HDFS, and then run these two Python file on Hadoop.
1 |
|
All articles in this blog adopt the CC BY-SA 4.0 agreement except for special statements. Please indicate the source for reprinting!