HIPI - Hadoop Image Processing Framework

tools/hibDump

hibDump is a simple MapReduce program that illustrates the basics of HIPI. It takes as input a HipiImageBundle (HIB) and writes a single text file to the HDFS that contains various properties about the images contained in the HIB: width, height, value of "source" image meta data record, and capture device stored in the image EXIF data. hibDump could easily be extended to query other information and, in this way, may serve as a natural starting point for custom HIPI programs.

Compiling

Compile hibDump by executing the following command in the HIPI tools directory (see our general notes on setting up HIPI on your system):
$> cd tools
$> gradle hibDump:jar
      

Usage

Run hibDump by executing the hibDump.sh script located in the tools directory. As with all of the tools scripts, running it without any arguments shows its usage:
$> ./hibDump.sh 
Usage: hibDump <input HIB> <output directory>
      
hibDump takes two arguments. The first argument is the path to a HIB on the HDFS. The second argument is the HDFS path to the output directory that will be created once the program has finished. The resulting image data will be stored as a text file named part-r-00000 in this directory.

Example

If the file tigers.hib exists in the current working directory on the HDFS, then the following command would produce a text file at tigers/part-r-00000 that contains basic information about its contents:
$> ./hibDump.sh tigers.hib tigers
...
[output ommitted]
      
$> hadoop fs -ls tigers
Found 2 items
-rw-r--r--   1 user group          0 2015-03-11 20:46 tigers/_SUCCESS
-rw-r--r--   1 user group        249 2015-03-11 20:46 tigers/part-r-00000
      
$> hadoop fs -cat tigers/part-r-00000
1 3210x2500 (/Users/hipiuser/Desktop/Tigers/4.PNG)   null
1 3810x2540 (/Users/hipiuser/Desktop/Tigers/3.jpg)   Canon EOS 450D
1 3210x2500 (/Users/hipiuser/Desktop/Tigers/2.JPG)   Canon EOS 450D
1 640x480   (/Users/hipiuser/Desktop/Tigers/1.jpg)   null
	    
Note that the text file part-r-00000 contains four lines, one for each image in tigers.hib (the order in which the images are listed may vary). The first column is always 1, the second column reports the resolution of each image as width x height, the third column is the value of the "source" key/value pair in the image meta data, and the fourth column lists the camera model stored in the image EXIF data (if available).

How hibDump works

One of the reasons we developed hibDump was to help illustrate how HIPI works. hibDump is implemented in the file tools/hibDump/src/main/java/org/hipi/tools/HibDump.java (relative to the HIPI root directory). As with any MapReduce program this file defines a driver class, a mapper class, and a reducer class.

The HibDump class is the driver class. This class is responsible for setting up the MapReduce job (e.g., specifying configuration parameters) and launching the job.

The HibDumpMapper class defined within the HibDump class is the mapper class. This class is responsible for defining the operation of the map tasks which are executed in parallel to process the input HIB file, image-by-image. As in any MapReduce program, this class must define a map() method that receives key/value pairs (aka "records") from the underlying record reader. This is often where the bulk of the high-level algorithm is implemented so it's well worth studying the definition of this method:
public void map(HipiImageHeader header, ByteImage image, Context context) throws IOException, InterruptedException
	    
Note that the key/value pair received by this method consists of a HipiImageHeader and a ByteImage, respectively. Mapper classes that use HIPI (technically, MapReduce programs that use the HibInputFormat) must define a map() method whose first two arguments are a HipiImageHeader followed by an object derived from the abstract base class HipiImage. Specifying a ByteImage as the second argument, as is done here, causes HIPI to decode the image pixel data into a flat array of Java bytes. HIPI takes care of all the low-level (but very important) details of how to efficiently read and decode the image data stored in the HIB and deliver it to the map() method in the form of this requested high-level Java object, freeing the developer to focus on the high-level image processing algorithms.

Finally, the HibDumpReducer class is the reducer class and it is responsible for defining the operation of the reduce tasks which receive their input from the map tasks and often (though not always) consolidate and further process this data before writing their output to the HDFS.

Now let's look at each of these key classes in detail.

The Driver Class: HibDump

This class defines a main() method which is the entry point of the MapReduce program. This method simply calls the run() method in the HibDump driver class using the standard ToolRunner Hadoop class:
public static void main(String[] args) throws Exception {
  int res = ToolRunner.run(new HibDump(), args);
  System.exit(res);
}
	
The run() method is responsible for configuring the MapReduce program. It first specifies the driver, mapper, and reducer classes:
public int run(String[] args) throws Exception {
  ...
  Configuration conf = new Configuration();

  Job job = Job.getInstance(conf, "hibDump");

  job.setJarByClass(HibDump.class);
  job.setMapperClass(HibDumpMapper.class);
  job.setReducerClass(HibDumpReducer.class);
	
Next, the run() method specifies the input format type and the types of objects that will be passed between the mapper and reducer:
  ...
  job.setInputFormatClass(HibInputFormat.class);
  job.setOutputKeyClass(IntWritable.class);
  job.setOutputValueClass(Text.class);
  ...
	
The first line indicates that the task of constructing key/value pairs for the mapper will be handled by the HibInputFormat class, which is a key part of the HIPI library. The second and third lines of code specify that the output of this MapReduce program (as well as the output of the individual map tasks) will be key/value pairs consisting of IntWritable objects and Text objects, respectively. The reason these objects are used in place of the Java types int and String, respectively, is because Hadoop requires that these objects be serializable and comparable. The IntWritable and Text Hadoop classes encapsulate these Java types while providing this added functionality.

Important If your mapper's output is different from your job's output then you must specify two additional classes:
  ...
  job.setMapOutputKeyClass(SomeClass.class);
  job.setMapOutputValueClass(SomeClass.class);
  ...
	
The last few lines of code in the run() method set the input and output paths, set the number of reduce tasks (in this case only one), and execute the job:
  ...
  FileInputFormat.setInputPaths(job, new Path(inputPath));
  FileOutputFormat.setOutputPath(job, new Path(outputPath));

  job.setNumReduceTasks(1);

  return job.waitForCompletion(true) ? 0 : 1;

}
	
Having only one reduce task forces the output to be written to a single file.

The Mapper Class: HibDumpMapper

A particularly important class in the HIPI library is the HibInputFormat class. This class is responsible for delivering parsed and decoded key/value pairs to the mapper in the form of a HipiImageHeader and a concrete object derived from the abstract base class HipiImage. The second argument in the map() method determines the type of object that HIPI uses to transfer the image pixel data. In the case of the map() method in HibDumpMapper, the ByteImage class is being used to represent the decoded image. This class stores the image pixel data as a flat array of 8-bit Java bytes in raster-scan interleaved order (RGBRGBRGB, etc.).

The map() method in HibDumpMapper is pretty simple. It queries the HipiImageHeader objects to obtain the spatial dimensions of the image, a string with the camera model from the image EXIF data (if available), and the value of the "source" key in the image meta data. These values are then assembled into a string. The only thing it does with the ByteImage argument is to check that it is not null to verify that it was successfully decoded.
public static class HibDumpMapper extends Mapper<HipiImageHeader, ByteImage, IntWritable, Text> {

  public void map(HipiImageHeader header, ByteImage image, Context context) throws IOException, InterruptedException  {

    String output = null;

    if (header == null) {
      output = "Failed to read image header.";
    } else if (image == null) {
      output = "Failed to decode image data.";
    } else {
      int w = header.getWidth();
      int h = header.getHeight();
      String source = header.getMetaData("source");
      String cameraModel = header.getExifData("Model");
      output = w + "x" + h + "\t(" + source + ")\t  " + cameraModel;
    }
    ...
        
The final step in the map() method is to emit this string, at which point it becomes input to one of the reduce tasks. As with any MapReduce program, the map() method technically emits a key/value pair (or record) by calling the write() method on the context object. In hibDump, the key is simply an IntWritable that is always set to 1 and the value is a Text object that wraps the output string. Using a single key ensures that all of the records are sent to the same reduce task. Since there is only one reduce task in this job, this ensures that a single output file will be produced that contains all of the image information:
    ...
    context.write(new IntWritable(1), new Text(outputStr));
  }
	

The Reducer Class: HibDumpReducer

The reducer class must implement the reduce() method. In hibDump this method is very simple and essentially passes the key/value pair it receives from the map task to the output list of the entire job. The underlying MapReduce framework handles the final step of writing the list of key/value pairs output by each reduce task to the HDFS.
 
public static class HibDumpReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
    
  @Override
  public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    for (Text value : values) {
      context.write(key, value);
    }
  }
    
}
	
Here is an example output file produced by running hibDump on a set of ten images downloaded from Flickr:
1 333x500 (http://farm7.staticflickr.com/6043/5903761694_73925517b5.jpg)    Canon EOS REBEL T2i
1 500x356 (http://farm8.staticflickr.com/7101/7165128731_656467c69e.jpg)    null
1 333x500 (http://farm1.staticflickr.com/184/375410166_f66bb309c6.jpg)      null
1 500x375 (http://farm4.staticflickr.com/3210/3666686294_8fd14356e2.jpg)    null
1 333x500 (http://farm4.staticflickr.com/3657/3620338550_c3b0213b9f.jpg)    null
1 500x333 (http://farm4.staticflickr.com/3526/5787850880_c28221457b.jpg)    Canon EOS 5D Mark II
1 500x334 (http://farm5.staticflickr.com/4053/4224177264_87a841e2b6.jpg)    null
1 500x332 (http://farm6.staticflickr.com/5461/9703646635_e7d37aa989.jpg)    null
1 500x334 (http://farm7.staticflickr.com/6125/5975790170_5d63ed0e92.jpg)    NIKON D60
1 500x375 (http://farm8.staticflickr.com/7123/7455047810_ea5b10a7a9.jpg)    null
	
Note that the first column contains the value of the key emitted by the reduce task, which, in this case, is always 1.

Next

Read about tools/hibDownload, a useful program for downloading a set of images from the Internet and storing them in a HIB.