Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/hdf5-with-h5py-vie

hdf5-with-h5py

All things being equal, the more training data we have, the more accurate a model we can train. But at some point, the training data will become too big to fit into memory. Although it is possible to write a generator to read data directly from disk while training, doing so will incur a massive performance penalty due to all those I/O operations.

Today, we will take a look at the HDF5 format and see how it can help us improve processing time. All the code in this article is written in Python, using the h5py package. You can download it from the link below.

https://gist.github.com/duongntbk/8f5828f74b082d6c5136790498ab8023

Our sample dataset

Keep in mind that the actual content of our test data is not important. We only care about how to store and read them using the HDF5 format. Because of that, you can use whatever dataset you have lying around.

If you don’t have a dataset ready, you can download the Pokemon Image Dataset from here. It consists of 819 images in JPEG and PNG format. The dimensions of each image are 256×256 pixels. Today, we will process the JPEG images in the pokemon_jpg folder.

To simulate a training dataset, we will split those images into 2 labels. Please create two new folders inside pokemon_jpg and move half of the images into each one.

.
pokemon_jpg
└── 0  (put 400 images in this folder)
     └── 1.jpg
     └── ...
     └── x.png
└── 1 (put the rest in this folder)
     └── y.png
     └── ...
     └── z.png

A baseline solution

Keras does support a solution to read images directly from disk via the image_dataset_from_directory method.

from tensorflow.keras.preprocessing import image_dataset_from_directory

dataset = image_dataset_from_directory(
    directory='pokemon_jpg',
    color_mode='rgb',
    batch_size=32,
    image_size=(256,256)
)

We can use dataset when training a model.

dataset = dataset.repeat() # Loop back to the start when we reach the end of the dataset
model.fit(dataset, batch_size=32, epochs=50)

Or we can iterate through the content of dataset.

for sample, labels in dataset:
    print(sample.shape, labels.shape)
    # Code to stop iterating

Use HDF5 format with h5py package

HDF5 is a format designed to store and organize large amounts of data, while ensuring that accessing that data can be done as fast and efficiently as possible. You can learn more about the HDF5 format here. We will use a package called h5py to read and write HDF5 files. You can install h5py with the following command.

pip install h5py

Convert training data to HDF5 format

The first step is to open a new .hdf5 file for writing, and create two datasets for the data and the labels.

db = h5py.File('pokemon_jpeg.hdf5', 'w')
data = db.create_dataset('data', (819, 256, 256, 3), dtype='float32') # We have 819 images, each one is 256x256 pixels with 3 color channels
labels = db.create_dataset('labels', 819, dtype = 'int') # Each of those 819 images has a corresponding label

Then we need to read all the images and convert their data to tensors.

from tensorflow.keras.preprocessing.image import img_to_array, load_img

image_path_1 = '/pokemon_jpg/0/1.jpg' # The label for this image is 0
image_1 = load_img(image_path, target_size=(256,256), interpolation='bilinear')
image_1 = img_to_array(image, data_format='channels_last')

image_path_2 = '/pokemon_jpg/1/401.jpg' # The label for this image is 1
image_2 = load_img(image_path, target_size=(256,256), interpolation='bilinear')
image_2 = img_to_array(image, data_format='channels_last')

# ...

And we can write the tensors and their labels into our HDF5 file.

data[0] = image_1
label[0] = 0
data[1] = image_2
label[1] = 1
# ...

After that, we need to close the HDF5 file.

db.close()

Note that the HDF5 file is much larger than the original images. The pokemon_jpg is only 33MB, but our pokemon_jpeg.hdf5 is 614MB.

Note: Obviously, we won’t write each tensor one-by-one into the HDF5 file. Instead, we will use a buffer to reduce the number of write-access. Please see the sample code for more details.

Read data from a HDF5 file

Reading from a HDF5 file is pretty straightforward. The first step is to connect to the HDF5 file.

db = h5py.File('pokemon_jpeg.hdf5')

Then we can access all the sample data and labels via the db object.

images = db['data'][0:10] # Retrieve the first 10 image tensors
labels = db['labels'][0:10] # And retrieve the first 10 labels

Note that h5py does not load the whole HDF5 file into memory, but only reads the necessary parts.

Create a generator from a HDF5 file

To feed the data from the HDF5 file into the fit method of Keras, we need to define a generator to yield data in batches. Below is a simple implementation.

def create_hdf5_generator(db_path, batch_size):
    db = h5py.File(db_path)
    db_size = db['data'].shape[0]

    while True: # loop through the dataset indefinitely
        for i in np.arange(0, db_size, batch_size):
            images = db['data'][i:i+batch_size]
            labels = db['labels'][i:i+batch_size]

            yield images, labels

We can test the generator like this.

db_path = 'pokemon_jpeg.hdf5'
batch_size = 32
hdf5_gen = create_hdf5_generator(db_path, batch_size)

samples, labels = next(hdf5_gen)
print(samples.shape) # Print (32, 256, 256, 3)
print(labels) # Print 32

Also, we can use that generator to train models.

model.fit(hdf5_gen, batch_size=32, epochs=50)

Benchmark results

Finally, let’s compare the performance of our new solution with the baseline. We will use the timeit method to measure the execution time.

import timeit

dataset = image_dataset_from_directory(
    directory='pokemon_jpg',
    color_mode='rgb',
    batch_size=32,
    image_size=(256,256)
)
normal_gen = iter(dataset.repeat())

db_path = 'pokemon_jpeg.hdf5'
batch_size = 32
hdf5_gen = create_hdf5_generator(db_path, batch_size)

rs_normal = timeit.timeit(lambda: next(normal_gen), number=1000)
rs_hdf5 = timeit.timeit(lambda: next(hdf5_gen), number=1000)

print(f'Baseline: {rs_normal}')
print(f'HDF5 benchmark: {rs_hdf5}')

Below are the results on my machine. We can see that the HDF5 solution is more than 50% faster than the baseline.

Baseline: 19.117717
HDF5 benchmark: 12.248213200000002

Conclusion

There is a famous saying about optimization: "There is no such thing as a free lunch." The HDF5 format allows us to quickly access tensor data from disk, at the expense of consuming more disk space. Personally, I find that the time we can save is much more valuable than the cost of some storage.

A software developer from Vietnam and is currently living in Japan.

One Thought on “Store training data in HDF5 format with h5py”

Leave a Reply