Skip to content

eecs485staff/madoop

Repository files navigation

Madoop: Michigan Hadoop

PyPI CI main codecov

Michigan Hadoop (madoop) is a light weight MapReduce framework for education. Madoop implements the Hadoop Streaming interface. Madoop is implemented in Python and runs on a single machine.

For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our Hadoop Streaming tutorial.

Quick start

Install Madoop.

$ pip install madoop

Create example MapReduce program with input files.

$ madoop --example
$ tree example
example
├── input
│   ├── input01.txt
│   └── input02.txt
├── map.py
└── reduce.py

Run example word count MapReduce program.

$ madoop \
  -input example/input \
  -output example/output \
  -mapper example/map.py \
  -reducer example/reduce.py

Concatenate and print the output.

$ cat example/output/part-*
Goodbye 1
Bye 1
Hadoop 2
World 2
Hello 2

Comparison with Apache Hadoop and CLI

Madoop implements a subset of the Hadoop Streaming interface. You can simulate the Hadoop Streaming interface at the command line with cat and sort.

Here's how to run our example MapReduce program on Apache Hadoop.

$ hadoop \
    jar path/to/hadoop-streaming-X.Y.Z.jar
    -input example/input \
    -output output \
    -mapper example/map.py \
    -reducer example/reduce.py
$ cat output/part-*

Here's how to run our example MapReduce program at the command line using cat and sort.

$ cat input/* | ./map.py | sort | ./reduce.py
Madoop Hadoop cat/sort
Implement some Hadoop options All Hadoop options No Hadoop options
Multiple mappers and reducers Multiple mappers and reducers One mapper, one reducer
Single machine Many machines Single Machine
jar hadoop-streaming-X.Y.Z.jar argument ignored jar hadoop-streaming-X.Y.Z.jar argument required No arguments
Lines within a group are sorted Lines within a group are sorted Lines within a group are sorted

Contributing

Contributions from the community are welcome! Check out the guide for contributing.

Acknowledgments

Michigan Hadoop is written by Andrew DeOrio [email protected].