HIPI - Hadoop Image Processing Framework


HIPI is a library for the Apache Hadoop programming framework that provides an API for performing image processing tasks in a distributed computing environment. Here is a system overview: The main input type used in HIPI is a HipiImageBundle (HIB). A HIB is a set of images combined into one large file along with some metadata describing the layout of the images. A HIB can be created from an existing set of images already located on the Hadoop Distributed File System (HDFS) or from a remote source (e.g. our Distributed Downloader example).

In order to improve efficiency, HIPI adds a culling step to the conventional MapReduce workflow. This allows discarding images that do not meet a set of criteria based on the image metadata (e.g., the image must be less than 10MP). Culling an image avoids the expensive step of decompressing the full pixel buffer into main memory. The user specified CullMapper is invoked on each image that survives the culling stage. Images are presented to this class as a FloatImage object with an associated ImageHeader object. Although HIPI does not modify any of the default job scheduling behavior in MapReduce, you can modify execution parameters specific to image processing tasks through the HipiJob object during setup.

Great care has been taken to ensure that the implementation of key steps in the Hadoop MapReduce workflow are efficient for image processing tasks. Please see our Experiments section for a more detailed analysis including performance benchmarks.

The Four Main Classes

The classes most frequently used by a practitioner are the HipiImageBundle, the FloatImage, and the CullMapper classes. These classes provide most of the functionality an average user will need and are described briefly below. The full API for HIPI can be found on our documentation page.

HIPI Image Bundle (HIB)

A HIPI Image Bundle (HIB) is a collection of images that are stored together in one file, somewhat analagous to .tar files in UNIX. HIBs are implemented via the HipiImageBundle class and can be used directly in the Hadoop MapReduce framework. Several common operations can be performed on HIBs, including:
Important Note:
HIBs must be opened in either FILE_MODE_READ or FILE_MODE_WRITE mode (see AbstractImageBundle::open), and once they have been opened in one mode, the mode cannot be switched.

Important Note:
When creating a HIB through the HipiImageBundle::create method, you can specify an optional parameter called blockSize. This parameter is very important as it controls the way a HIB will be distributed to the map tasks. The ImageBundleInputFormat splits a HIB into sections based on the block size of the HIB itself in order to most effectively distirbute the job. However, if the block size is set too large, then a small number of machines will be processing all of the images in the HIB. Therefore it is advised that the block size be set such that the number of blocks the entire HIB spans is roughly the number of nodes in the cluster.

You can create a HIPI Image Bundle from a set of files using the operations listed above, or via some external source (e.g. Flickr) via our Distributed Downloader example. Please see the HIB javadoc page for more details on all of the HIB operations, and the experiments page for information regarding the design of HIBs.

Float Image

The primary input to Map tasks in HIPI is the FloatImage class. It is a simple representation of an image file as a set of pixels specified with single floating-point precision. Several common operations can be performed on a FloatImage: A FloatImage is actually just a three-dimensional set of floating-point values. Thus it can also be treated as a matrix (or tensor) and used accordingly. The function FloatImage::getData converts the FloatImage into its array representation and can be used directly with the popular BLAS/LAPACK Java implementation f2j to perform standard matrix operations. Creating a FloatImage from an array of pixels can be performed via a special form of the FloatImage constructor.

As mentioned in the Overview section, HIPI takes as input a HIPI Image Bundle and sends records (key-value pairs) to user-specified Mapper functions with the key being an object of type ImageHeader and the value being an object of type FloatImage. For each image contained in the HIB that passes the optional culling step one such record will be generated.

Cull Mapper

HIPI provides a way for users to discard, or cull, images that do not meet a set of criteria. The CullMapper class defines a special type of Hadoop Mapper that contains an extra function CullMapper::cull that can be used to test an ImageHeader associated with a FloatImage for arbitrary criteria.

In the example below, the CullMapper is being used (which inherits directly from Mapper) to process images taken with a Canon PowerShot S500 digital camera and having dimensions 2592 by 1944. The function CullMapper::cull returns a boolean indicating whether the image described by ImageHeader should be discarded (true) or should be processed (false);
    public static class MyMapper extends CullMapper<ImageHeader, FloatImage, NullWritable, FloatImage>
        public boolean cull(ImageHeader key) throws IOException, InterruptedException {
            if(key.getEXIFInformation("Model").equals("Canon PowerShot S500") && key.width == 2592 && key.height == 1944)
                return false;
                return true;
        public void map(ImageHeader key, FloatImage value, Context context) throws IOException, InterruptedException {


HIPI provides a useful extension of the standard Hadoop Job class that allows a user to set parameters that are common in scenarios where HIPI would be used. The two main operations that can be performed using a HipiJob are: The former method (HipiJob::set{Map,Reduce}SpeculativeExecution) controls whether Hadoop should run multiple instances of the the same Map task to potentially increase performance. Whether this is a good idea is specific to the application, but in general, this should be enabled. This ensures that if a particular node in the Hadoop cluster is experiencing performance degradation, the entire job will not be affected. Hadoop will automatically kill the slower task, and use the output from a different instance. Of course this incurs the overhead of spawning more tasks than is actually needed.

The second operation (HipiJob::setCompressMapOutput) enables or disables compression of the output records from the Map tasks before they are sent to the Reduce tasks. This option should be enabled if there is a significant amount of data being transfered between the two sets of tasks. Note that the efficacy of this approach is controlled implicitly by how well the records can be compressed and the relationship between the size of the records, the bandwidth of the cluster, and the time spend compressing and decompressing the files.