Note: phiên bản Tiếng Việt của bài này ở link dưới.
https://duongnt.com/hdf5-with-h5py-vie
All things being equal, the more training data we have, the more accurate a model we can train. But at some point, the training data will become too big to fit into memory. Although it is possible to write a generator to read data directly from disk while training, doing so will incur a massive performance penalty due to all those I/O operations.
Today, we will take a look at the HDF5 format and see how it can help us improve processing time. All the code in this article is written in Python, using the h5py package. You can download it from the link below.
https://gist.github.com/duongntbk/8f5828f74b082d6c5136790498ab8023
Our sample dataset
Keep in mind that the actual content of our test data is not important. We only care about how to store and read them using the HDF5 format. Because of that, you can use whatever dataset you have lying around.
If you don’t have a dataset ready, you can download the Pokemon Image Dataset from here. It consists of 819 images in JPEG and PNG format. The dimensions of each image are 256×256 pixels. Today, we will process the JPEG images in the pokemon_jpg
folder.
To simulate a training dataset, we will split those images into 2 labels. Please create two new folders inside pokemon_jpg
and move half of the images into each one.
.
pokemon_jpg
└── 0 (put 400 images in this folder)
└── 1.jpg
└── ...
└── x.png
└── 1 (put the rest in this folder)
└── y.png
└── ...
└── z.png
A baseline solution
Keras does support a solution to read images directly from disk via the image_dataset_from_directory method.
from tensorflow.keras.preprocessing import image_dataset_from_directory
dataset = image_dataset_from_directory(
directory='pokemon_jpg',
color_mode='rgb',
batch_size=32,
image_size=(256,256)
)
We can use dataset
when training a model.
dataset = dataset.repeat() # Loop back to the start when we reach the end of the dataset
model.fit(dataset, batch_size=32, epochs=50)
Or we can iterate through the content of dataset
.
for sample, labels in dataset:
print(sample.shape, labels.shape)
# Code to stop iterating
Use HDF5 format with h5py package
HDF5 is a format designed to store and organize large amounts of data, while ensuring that accessing that data can be done as fast and efficiently as possible. You can learn more about the HDF5 format here. We will use a package called h5py to read and write HDF5 files. You can install h5py with the following command.
pip install h5py
Convert training data to HDF5 format
The first step is to open a new .hdf5
file for writing, and create two datasets for the data and the labels.
db = h5py.File('pokemon_jpeg.hdf5', 'w')
data = db.create_dataset('data', (819, 256, 256, 3), dtype='float32') # We have 819 images, each one is 256x256 pixels with 3 color channels
labels = db.create_dataset('labels', 819, dtype = 'int') # Each of those 819 images has a corresponding label
Then we need to read all the images and convert their data to tensors.
from tensorflow.keras.preprocessing.image import img_to_array, load_img
image_path_1 = '/pokemon_jpg/0/1.jpg' # The label for this image is 0
image_1 = load_img(image_path, target_size=(256,256), interpolation='bilinear')
image_1 = img_to_array(image, data_format='channels_last')
image_path_2 = '/pokemon_jpg/1/401.jpg' # The label for this image is 1
image_2 = load_img(image_path, target_size=(256,256), interpolation='bilinear')
image_2 = img_to_array(image, data_format='channels_last')
# ...
And we can write the tensors and their labels into our HDF5 file.
data[0] = image_1
label[0] = 0
data[1] = image_2
label[1] = 1
# ...
After that, we need to close the HDF5 file.
db.close()
Note that the HDF5 file is much larger than the original images. The pokemon_jpg
is only 33MB, but our pokemon_jpeg.hdf5
is 614MB.
Note: Obviously, we won’t write each tensor one-by-one into the HDF5 file. Instead, we will use a buffer to reduce the number of write-access. Please see the sample code for more details.
Read data from a HDF5 file
Reading from a HDF5 file is pretty straightforward. The first step is to connect to the HDF5 file.
db = h5py.File('pokemon_jpeg.hdf5')
Then we can access all the sample data and labels via the db
object.
images = db['data'][0:10] # Retrieve the first 10 image tensors
labels = db['labels'][0:10] # And retrieve the first 10 labels
Note that h5py does not load the whole HDF5 file into memory, but only reads the necessary parts.
Create a generator from a HDF5 file
To feed the data from the HDF5 file into the fit
method of Keras, we need to define a generator to yield data in batches. Below is a simple implementation.
def create_hdf5_generator(db_path, batch_size):
db = h5py.File(db_path)
db_size = db['data'].shape[0]
while True: # loop through the dataset indefinitely
for i in np.arange(0, db_size, batch_size):
images = db['data'][i:i+batch_size]
labels = db['labels'][i:i+batch_size]
yield images, labels
We can test the generator like this.
db_path = 'pokemon_jpeg.hdf5'
batch_size = 32
hdf5_gen = create_hdf5_generator(db_path, batch_size)
samples, labels = next(hdf5_gen)
print(samples.shape) # Print (32, 256, 256, 3)
print(labels) # Print 32
Also, we can use that generator to train models.
model.fit(hdf5_gen, batch_size=32, epochs=50)
Benchmark results
Finally, let’s compare the performance of our new solution with the baseline. We will use the timeit method to measure the execution time.
import timeit
dataset = image_dataset_from_directory(
directory='pokemon_jpg',
color_mode='rgb',
batch_size=32,
image_size=(256,256)
)
normal_gen = iter(dataset.repeat())
db_path = 'pokemon_jpeg.hdf5'
batch_size = 32
hdf5_gen = create_hdf5_generator(db_path, batch_size)
rs_normal = timeit.timeit(lambda: next(normal_gen), number=1000)
rs_hdf5 = timeit.timeit(lambda: next(hdf5_gen), number=1000)
print(f'Baseline: {rs_normal}')
print(f'HDF5 benchmark: {rs_hdf5}')
Below are the results on my machine. We can see that the HDF5 solution is more than 50% faster than the baseline.
Baseline: 19.117717
HDF5 benchmark: 12.248213200000002
Conclusion
There is a famous saying about optimization: "There is no such thing as a free lunch." The HDF5 format allows us to quickly access tensor data from disk, at the expense of consuming more disk space. Personally, I find that the time we can save is much more valuable than the cost of some storage.
One Thought on “Store training data in HDF5 format with h5py”