HIPI - Hadoop Image Processing Framework

Overview

HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image processing with MapReduce style parallel programs typically executed on a cluster. It provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efficient distributed processing. HIPI is developed and maintained by a growing number of developers from around the world.

The latest release of HIPI has been tested with Hadoop 2.6.0.

View HIPI Repository on GitHub

System Design

This diagram shows the organization of a typical MapReduce/HIPI program:

The primary input object to a HIPI program is a HipiImageBundle (HIB). A HIB is a collection of images represented as a single file on the HDFS. The HIPI distribution includes several tools for creating HIBs, including a MapReduce program that builds a HIB from a list of images downloaded from the Internet.

The first processing stage of a HIPI program is a culling step that allows filtering the images in a HIB based on a variety of user-defined conditions like spatial resolution or criteria related to the image metadata. This functionality is achieved through the CullMapper HIPI class. Images that are culled are never fully decoded and decompressed, saving valuable processing time.

The images that survive the culling stage are assigned to individual map tasks in a way that attempts to maximize data locality, a cornerstone of the Hadoop MapReduce programming model. This functionality is achieved through the ImageBundleInputFormat class. Images are presented to map tasks as a FloatImage object with an associated ImageHeader object. The FloatImage class includes a number of useful methods like cropping, color conversion, addition, and scaling.

The records emitted by the map stage are routed to reduce tasks according to the MapReduce shuffle algorithm that attemps to minimize network traffic. Finally, the user-defined reduce tasks are executed in parallel and their output is collected and written to the HDFS.

The goal of HIPI is to provide a simple and clean interface for high-throughput distributed image processing on the MapReduce platform. To this end, we performed a series of experiments that show HIPI favorably compares to several common alternatives for representing and processing large collections of binary data in MapReduce.