HIPI - Hadoop Image Processing Framework

Getting Started

This page provides a quick start guide to setting up HIPI on your system and writing your first MapReduce/HIPI program.

1. Setup Java >=7

HIPI is written in Java and has been tested with Java 7 and 8. Check your version of Java with the following command:
$> java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
        
Visit Oracle's Java page to install the latest version.

2. Setup Hadoop

HIPI works with a standard installation of the Apache Hadoop Distributed File System (HDFS) and MapReduce. HIPI has been tested with Hadoop version 2.7.1.

If you haven't already done so, download and install Hadoop by following the instructions on the official Apache Hadoop website. For first-time users, two useful resources are the Quickstart Guide and the Single Cluster Setup.

Verify that the main Hadoop script is reachable from your path:
$> which hadoop
/usr/local/bin/hadoop
      	
The tools and example programs that come with HIPI assume that this is the case.

3. Setup Gradle

The HIPI distribution uses the Gradle build automation system to manage compilation and package assembly. HIPI has been tested with Gradle version 2.5.

Install Gradle on your system and verify that it is reachable as well:
$> which gradle
/usr/local/bin/gradle
      	

4. Install HIPI

There are two ways to install HIPI on your system:
  1. Clone the latest HIPI distribution from GitHub and build from source. (Recommended)
  2. Download a precompiled JAR from the downloads page.

Clone the HIPI GitHub Repository

The best way to get the latest version of HIPI is by cloning the official GitHub repository and building it along with all of the tools and example programs yourself. This only takes a few minutes and verifies that your system is properly setup and ready to begin developing your own HIPI applications:
$> git clone git@github.com:uvagfx/hipi.git
        

Build the HIPI Library and Example Programs

From the HIPI root directory, simply run gradle to build the HIPI library along with all of the tools and example programs:
$> cd hipi
$> gradle
:core:compileJava
:core:processResources
:core:classes
:core:jar
:tools:downloader:compileJava
:tools:downloader:processResources
:tools:downloader:classes
:tools:downloader:jar
:tools:dumpHib:compileJava
:tools:dumpHib:processResources
:tools:dumpHib:classes
:tools:dumpHib:jar
...
:install

Finished building the HIPI library along with all tools and examples.

BUILD SUCCESSFUL

Total time: 2.058 secs	
        
If the build fails, first carefully review the steps above. If you are convinced that you are doing everything correctly and that you've found an issue with the HIPI distribution or documentation please post a question to the HIPI Users Group or file a bug report.

After the build finishes, you may want to inspect the settings.gradle file in the root directory and the build.gradle files in each directory to familiarize yourself with the various build targets. If you're new to Gradle, we recommend reviewing this tutorial. For example, to build only the hibImport tool from scratch:
$> gradle clean tools:hibImport:jar
:core:clean
...
:core:compileJava
:core:processResources UP-TO-DATE
:core:classes
:core:jar
:tools:hibImport:compileJava
:tools:hibImport:processResources UP-TO-DATE
:tools:hibImport:classes
:tools:hibImport:jar

BUILD SUCCESSFUL

Total time: 1.197 secs
        
HIPI is now installed on your system. To learn about future updates to the HIPI distribution you should join the HIPI Users Group and watch the HIPI GitHub repository. You can always obtain the latest version of HIPI on the release branch with the following git pull command:
$> git pull origin release
From github.com:uvagfx/hipi
 * branch            release    -> FETCH_HEAD
Already up-to-date.
       	
Also, you can experiment with the development branch, which contains the latest features that have not yet been integrated into the main release branch. Note that the development branch is generally less stable than the release branch.

5. Setup Eclipse (Optional)

If you would like to integrate HIPI into an Eclipse project (useful for debugging), take a look at our Eclipse Setup Guide.

Next, we will walk you through the process of writing your first HIPI program. Be sure to also check out the tools and example programs to learn more about HIPI.

Your First HIPI Program

This section will walk you through the process of creating a very simple HIPI program that computes the average pixel color over a set of images. First, we need a set of images to work with. Recall that the primary input type to a HIPI program is a HipiImageBundle (HIB), which stores a collection of images on the Hadoop Distributed File System (HDFS). Use the hibImport tool to create a HIB from a collection of images on your local file system located in the directory ~/SampleImages by executing the following command from the HIPI root directory:
$> tools/hibImport.sh ~/SampleImages sampleimages.hib
Input image directory: /Users/jason/SampleImages
Output HIB: sampleimages.hib
Overwrite HIB if it exists: false
HIPI: Using default blockSize of [134217728].
HIPI: Using default replication factor of [1].
 ** added: 1.jpg
 ** added: 2.jpg
 ** added: 3.jpg
Created: sampleimages.hib and sampleimages.hib.dat
        
If this command fails, double check that you successfully built the HIPI library and the tools by following the directions above.

Note that importHib actually creates two files in the current working directory of the HDFS: sampleimages.hib and sampleimages.hib.dat. You can verify that this is the case with the command: hadoop fs -ls. (You can learn how the hibImport tool works here after finishing this tutorial.)

You can use the handy hibInfo tool that comes with HIPI to inspect the contents of this newly created HIB file:
$> tools/hibInfo.sh sampleimages.hib --show-meta
Input HIB: sampleimages.hib
Display meta data: true
Display EXIF data: false
IMAGE INDEX: 0
   640 x 480
   format: 1
   meta: {source=/Users/hipiuser/SampleImages/1.jpg}
IMAGE INDEX: 1
   3210 x 2500
   format: 1
   meta: {source=/Users/hipiuser/SampleImages/2.jpg}
IMAGE INDEX: 2
   3810 x 2540
   format: 1
   meta: {source=/Users/hipiuser/SampleImages/3.jpg}
Found [3] images.
        
Note that your specific output may vary from what is shown above since you will be working with different images and different paths. Run hibInfo.sh without any arguments to see a description of its usage.

Next, following the conventions of Gradle, create a source directory hierarchy for your program by executing the following command in the root directory:
$> mkdir -p examples/helloWorld/src/main/java/org/hipi/examples
        
Next, let's add a Gradle build task for our new program by creating the file examples/helloWorld/build.gradle with the following contents:
jar {
  manifest {
    attributes("Main-Class": "org.hipi.examples.HelloWorld")
  }
}
       	
We also need to update the settings.gradle file in the root directory to tell Gradle about this new build target:
include ':core', ':tools:hibImport', ... ':examples:covar', ':examples:helloWorld'
        
Next, create a new Java source file at examples/helloWorld/src/main/java/org/hipi/examples/HelloWorld.java that contains the following code:
package org.hipi.examples;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class HelloWorld extends Configured implements Tool {

  public int run(String[] args) throws Exception {
    System.out.println("Hello HIPI!");
    return 0;
  }

  public static void main(String[] args) throws Exception {
    ToolRunner.run(new HelloWorld(), args);
    System.exit(0);
  }

}
	      
The entry point of every Java program is the public static void main() method. As in most MapReduce applications, the main method in our program uses the ToolRunner Hadoop class to call the run() method in this driver class.

Build this very simple program by running the command gradle jar in the examples/helloWorld directory:
$> cd examples/helloWorld
$> gradle jar
:core:compileJava UP-TO-DATE
:core:processResources UP-TO-DATE
:core:classes UP-TO-DATE
:core:jar UP-TO-DATE
:examples:helloWorld:compileJava UP-TO-DATE
:examples:helloWorld:processResources UP-TO-DATE
:examples:helloWorld:classes UP-TO-DATE
:examples:helloWorld:jar

BUILD SUCCESSFUL

Total time: 1.191 secs
        
If the build is successful, it will produce the JAR file examples/helloWorld/build/libs/helloWorld.jar directory. Run this program using the following command from within the examples/helloWorld directory:
  $> hadoop jar build/libs/helloWorld.jar
  Hello HIPI!
        
Congratulations! You just created a very simple MapReduce program. Now let's make our program do some image processing with HIPI.

MapReduce

Hadoop's MapReduce parallel programming framework is a powerful tool for large-scale distributed computing. If this is your first experience with MapReduce, we recommend reading the official Apache MapReduce tutorial, which gives a nice introduction to this programming model. Another great read is the seminal paper written by Jeffrey Dean and Sanjay Ghemawat at Google titled MapReduce: Simplified Data Processing on Large Clusters.

First, let's extend the run() method in HelloWorld.java to initialize and execute a MapReduce job and create stubs for our Mapper and Reducer classes:
package org.hipi.examples;

import org.hipi.image.FloatImage;
import org.hipi.image.HipiImageHeader;
import org.hipi.imagebundle.mapreduce.HibInputFormat;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class HelloWorld extends Configured implements Tool {
  
  public static class HelloWorldMapper extends Mapper<HipiImageHeader, FloatImage, IntWritable, FloatImage> {
    public void map(HipiImageHeader key, FloatImage value, Context context) 
      throws IOException, InterruptedException {
    }
  }
  
  public static class HelloWorldReducer extends Reducer<IntWritable, FloatImage, IntWritable, Text> {
    public void reduce(IntWritable key, Iterable<FloatImage> values, Context context) 
      throws IOException, InterruptedException {
    }
  }
  
  public int run(String[] args) throws Exception {
    // Check input arguments
    if (args.length != 2) {
      System.out.println("Usage: helloWorld <input HIB> <output directory>");
      System.exit(0);
    }
    
    // Initialize and configure MapReduce job
    Job job = Job.getInstance();
    // Set input format class which parses the input HIB and spawns map tasks
    job.setInputFormatClass(HibInputFormat.class);
    // Set the driver, mapper, and reducer classes which express the computation
    job.setJarByClass(HelloWorld.class);
    job.setMapperClass(HelloWorldMapper.class);
    job.setReducerClass(HelloWorldReducer.class);
    // Set the types for the key/value pairs passed to/from map and reduce layers
    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(FloatImage.class);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(Text.class);
    
    // Set the input and output paths on the HDFS
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // Execute the MapReduce job and block until it complets
    boolean success = job.waitForCompletion(true);
    
    // Return success or failure
    return success ? 0 : 1;
  }
  
  public static void main(String[] args) throws Exception {
    ToolRunner.run(new HelloWorld(), args);
    System.exit(0);
  }
  
}
	
Most of this code imports necessary Hadoop and HIPI libraries and configures and launches the MapReduce Job. This type of code will become somewhat boilerplate across the MapReduce/HIPI programs you develop, but it's still important to understand what is going on.

The first lines of code in the run() method validate the arguments passed to the program, create the Hadoop Job object and call setter methods on this object to specify the classes that implement the map and reduce tasks along with the types of objects that are passed between these processing stages. The remaining lines of code setup the path to the input file and the output directory and launch the program. The code descriptions on the tools and examples page give much more detail about these parts of a HIPI program, which we will skip for now. Instead we will focus on the higher-level algorithm that will go in the map() and reduce() methods.

Before proceeding further, test that your code still compiles and runs by repeating the steps above, but don't expect it to do anything yet.

Computing The Average Pixel Color

Now let's add some actual HIPI image processing code to our program. For this example, we will be computing the average RGB value of the pixels in the images in our input HIB. Our mapper will compute the average pixel color over a single image and the reducer will add these averages together and divide by their count to compute the total average pixel color. Because the map tasks are executed in parallel, if our Hadoop cluster has more than one compute node we will perform this entire operation faster than if we were using a single machine. This is the key idea behind parallel computing in MapReduce.

Here is what our map() method looks like:
  public static class HelloWorldMapper extends Mapper<HipiImageHeader, FloatImage, IntWritable, FloatImage> {
    
    public void map(HipiImageHeader key, FloatImage value, Context context) 
        throws IOException, InterruptedException {

      // Verify that image was properly decoded, is of sufficient size, and has three color channels (RGB)
      if (value != null && value.getWidth() > 1 && value.getHeight() > 1 && value.getNumBands() == 3) {

        // Get dimensions of image
        int w = value.getWidth();
        int h = value.getHeight();

        // Get pointer to image data
        float[] valData = value.getData();

        // Initialize 3 element array to hold RGB pixel average
        float[] avgData = {0,0,0};

        // Traverse image pixel data in raster-scan order and update running average
        for (int j = 0; j < h; j++) {
          for (int i = 0; i < w; i++) {
            avgData[0] += valData[(j*w+i)*3+0]; // R
            avgData[1] += valData[(j*w+i)*3+1]; // G
            avgData[2] += valData[(j*w+i)*3+2]; // B
          }
        }

        // Create a FloatImage to store the average value
        FloatImage avg = new FloatImage(1, 1, 3, avgData);

        // Divide by number of pixels in image
        avg.scale(1.0f/(float)(w*h));

        // Emit record to reducer
        context.write(new IntWritable(1), avg);

      } // If (value != null...
      
    } // map()

  } // HelloWorldMapper
	
The first two arguments of the map() method are a key/value pair (often called a "record" in MapReduce terminology) that are constructed by the HibInputFormat HibRecordReader classes. In this case, these two arguments are a HipiImageHeader (the "key") and a FloatImage (the "value"), respectively. In HIPI, the first argument of the map() method must always be a HipiImageHeader, but the second argument can be any type that extends the abstract base class HipiImage. This gives you, the developer, control over how images are decoded into memory.

Note that this map() method produces a record for each image in the HIB which is sent to the reduce processing stage using the context.write() method. These records consist of an IntWritable (that is always equal to 1) and another HIPI FloatImage object that contains the image's computed average pixel value. These records are collected by the MapReduce framework and become inputs to the reduce() method as an Iterable list of FloatImage objects where they are added together and normalized to obtain the final result:
  public static class HelloWorldReducer extends Reducer<IntWritable, FloatImage, IntWritable, Text> {

    public void reduce(IntWritable key, Iterable<FloatImage> values, Context context)
        throws IOException, InterruptedException {

      // Create FloatImage object to hold final result
      FloatImage avg = new FloatImage(1, 1, 3);

      // Initialize a counter and iterate over IntWritable/FloatImage records from mapper
      int total = 0;
      for (FloatImage val : values) {
        avg.add(val);
        total++;
      }

      if (total > 0) {
        // Normalize sum to obtain average
        avg.scale(1.0f / total);
        // Assemble final output as string
	float[] avgData = avg.getData();
        String result = String.format("Average pixel value: %f %f %f", avgData[0], avgData[1], avgData[2]);
        // Emit output of job which will be written to HDFS
        context.write(key, new Text(result));
      }

    } // reduce()

  } // HelloWorldReducer
	
Next, build helloWorld.jar and run it using the HIB we created at the beginning:
$> gradle jar
:core:compileJava UP-TO-DATE
:core:processResources UP-TO-DATE
:core:classes UP-TO-DATE
:core:jar UP-TO-DATE
:examples:helloWorld:compileJava
:examples:helloWorld:processResources UP-TO-DATE
:examples:helloWorld:classes
:examples:helloWorld:jar

BUILD SUCCESSFUL

Total time: 0.855 secs
  
$> hadoop jar build/libs/helloWorld.jar sampleimages.hib sampleimages_average
...
        
If everything goes as planned the directory sampleimages_average will contain two files:
$> hadoop fs -ls sampleimages_average
Found 2 items
-rw-r--r--   1 user group        0 2015-03-13 09:52 sampleimages_average/_SUCCESS
-rw-r--r--   1 user group       50 2015-03-13 09:52 sampleimages_average/part-r-00000
        
Whenever a MapReduce program successfully finishes, it creates the file _SUCCESS in the output directory along with a part-r-XXXXX file for each reduce task. The average pixel value can be retrieved using the cat command:
$> hadoop fs -cat sampleimages_average/part-r-00000
1 Average pixel value: 0.321921 0.224995 0.150284
        
Feel free to play around with different image sets and see how it affects the average pixel color. (Note: you will need to remove the output directory before running the program a second time with the command: hadoop fs -rm -R sampleimages_average.)

Next

Read the descriptions of the other tools and example programs or jump into the documentation to learn more about HIPI.