Final Up to date on August 6, 2022
Whenever you construct and prepare a Keras deep studying mannequin, you possibly can present the coaching knowledge in a number of alternative ways. Presenting the information as a NumPy array or a TensorFlow tensor is widespread. One other approach is to make a Python generator perform and let the coaching loop learn knowledge from it. One more approach of offering knowledge is to make use of tf.knowledge
 dataset.
On this tutorial, you will note how you should use the tf.knowledge
dataset for a Keras mannequin. After ending this tutorial, you’ll study:
- How one can create and use the
tf.knowledge
 dataset - The good thing about doing so in comparison with a generator perform
Let’s get began.

A delicate introduction to the tensorflow.knowledge API
Picture by Monika MG. Some rights reserved.
Overview
This text is split into 4 sections; they’re:
- Coaching a Keras Mannequin with NumPy Array and Generator Operate
- Making a Dataset utilizing
tf.knowledge
- Making a Dataset from Generator Operate
- Information with Prefetch
Coaching a Keras Mannequin with NumPy Array and Generator Operate
Earlier than you see how the tf.knowledge
API works, let’s evaluate the way you would possibly normally prepare a Keras mannequin.
First, you want a dataset. An instance is the style MNIST dataset that comes with the Keras API. This dataset has 60,000 coaching samples and 10,000 check samples of 28×28 pixels in grayscale, and the corresponding classification label is encoded with integers 0 to 9.
The dataset is a NumPy array. Then you possibly can construct a Keras mannequin for classification, and with the mannequin’s match()
perform, you present the NumPy array as knowledge.
The whole code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() print(train_image.form) print(train_label.form) print(test_image.form) print(test_label.form)  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ]) mannequin.compile(optimizer=“adam”,               loss=“sparse_categorical_crossentropy”,               metrics=“sparse_categorical_accuracy”) historical past = mannequin.match(train_image, train_label,                     batch_size=32, epochs=50,                     validation_data=(test_image, test_label), verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Working this code will print out the next:
(60000, 28, 28) (60000,) (10000, 28, 28) (10000,) 313/313 [==============================] – 0s 392us/step – loss: 0.5114 – sparse_categorical_accuracy: 0.8446 [0.5113903284072876, 0.8446000218391418] |
And likewise, create the next plot of validation accuracy over the 50 epochs you educated your mannequin:
The opposite approach of coaching the identical community is to supply the information from a Python generator perform as a substitute of a NumPy array. A generator perform is the one with a yield
assertion to emit knowledge whereas the perform runs parallel to the information client. A generator of the style MNIST dataset will be created as follows:
def batch_generator(picture, label, batchsize):     N = len(picture)     i = 0     whereas True:         yield picture[i:i+batchsize], label[i:i+batchsize]         i = i + batchsize         if i + batchsize > N:             i = 0 |
This perform is meant to be referred to as with the syntax batch_generator(train_image, train_label, 32)
. It’s going to scan the enter arrays in batches indefinitely. As soon as it reaches the tip of the array, it should restart from the start.
Coaching a Keras mannequin with a generator is much like utilizing the match()
perform:
historical past = mannequin.match(batch_generator(train_image, train_label, 32), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â steps_per_epoch=len(train_image)//32, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â epochs=50, validation_data=(test_image, test_label), verbose=0) |
As an alternative of offering the information and label, you simply want to supply the generator as it should give out each. When knowledge are introduced as a NumPy array, you possibly can inform what number of samples there are by trying on the size of the array. Keras can full one epoch when all the dataset is used as soon as. Nonetheless, your generator perform will emit batches indefinitely, so you might want to inform it when an epoch is ended, utilizing the steps_per_epoch
argument to the match()
perform.
Within the above code, the validation knowledge was supplied as a NumPy array, however you should use a generator as a substitute and specify the validation_steps
argument.
The next is the entire code utilizing a generator perform, through which the output is identical because the earlier instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() print(train_image.form) print(train_label.form) print(test_image.form) print(test_label.form)  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  def batch_generator(picture, label, batchsize):     N = len(picture)     i = 0     whereas True:         yield picture[i:i+batchsize], label[i:i+batchsize]         i = i + batchsize         if i + batchsize > N:             i = 0  mannequin.compile(optimizer=“adam”,               loss=“sparse_categorical_crossentropy”,               metrics=“sparse_categorical_accuracy”) historical past = mannequin.match(batch_generator(train_image, train_label, 32),                     steps_per_epoch=len(train_image)//32,                     epochs=50, validation_data=(test_image, test_label), verbose=0) print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Making a Dataset Utilizing tf.knowledge
Given that you’ve the style MNIST knowledge loaded, you possibly can convert it right into a tf.knowledge
dataset, like the next:
... dataset = tf.knowledge.Dataset.from_tensor_slices((train_image, train_label)) print(dataset.element_spec) |
This prints the dataset’s spec as follows:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
You’ll be able to see the information is a tuple (as a tuple was handed as an argument to the from_tensor_slices()
perform), whereas the primary factor is within the form (28,28)
whereas the second factor is a scalar. Each parts are saved as 8-bit unsigned integers.
If you don’t current the information as a tuple of two NumPy arrays if you create the dataset, you may also do it later. The next creates the identical dataset however first creates the dataset for the picture knowledge and the label individually earlier than combining them:
... train_image_data = tf.knowledge.Dataset.from_tensor_slices(train_image) train_label_data = tf.knowledge.Dataset.from_tensor_slices(train_label) dataset = tf.knowledge.Dataset.zip((train_image_data, train_label_data)) print(dataset.element_spec) |
This can print the identical spec:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
The zip()
perform within the dataset is just like the zip()
perform in Python as a result of it matches knowledge one after the other from a number of datasets right into a tuple.
One advantage of utilizing the tf.knowledge
dataset is the flexibleness in dealing with the information. Beneath is the entire code on how one can prepare a Keras mannequin utilizing a dataset through which the batch measurement is about to the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() dataset = tf.knowledge.Dataset.from_tensor_slices((train_image, train_label))  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  historical past = mannequin.match(dataset.batch(32),                     epochs=50,                     validation_data=(test_image, test_label),                     verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
That is the best use case of utilizing a dataset. If you happen to dive deeper, you possibly can see {that a} dataset is simply an iterator. Subsequently, you possibly can print out every pattern in a dataset utilizing the next:
for picture, label in dataset:     print(picture)  # array of form (28,28) in tf.Tensor     print(label)  # integer label in tf.Tensor |
The dataset has many capabilities in-built. The batch()
used earlier than is certainly one of them. If you happen to create batches from a dataset and print them, you’ve gotten the next:
for picture, label in dataset.batch(32):     print(picture)  # array of form (32,28,28) in tf.Tensor     print(label)  # array of form (32,) in tf.Tensor |
Right here, every merchandise from a batch just isn’t a pattern however a batch of samples. You even have capabilities akin to map()
, filter()
, and scale back()
for sequence transformation, or concatendate()
and interleave()
for combining with one other dataset. There are additionally repeat()
, take()
, take_while()
, and skip()
like our acquainted counterpart from Python’s itertools
module. A full record of the capabilities will be discovered within the API documentation.
Making a Dataset from Generator Operate
To this point, you noticed how a dataset may very well be used instead of a NumPy array in coaching a Keras mannequin. Certainly, a dataset may also be created out of a generator perform. However as a substitute of a generator perform that generates a batch, as you noticed in one of many examples above, you now make a generator perform that generates one pattern at a time. The next is the perform:
import numpy as np import tensorflow as tf  def shuffle_generator(picture, label, seed):     idx = np.arange(len(picture))     np.random.default_rng(seed).shuffle(idx)     for i in idx:         yield picture[i], label[i]  dataset = tf.knowledge.Dataset.from_generator(     shuffle_generator,     args=[train_image, train_label, 42],     output_signature=(         tf.TensorSpec(form=(28,28), dtype=tf.uint8),         tf.TensorSpec(form=(), dtype=tf.uint8))) print(dataset.element_spec) |
This perform randomizes the enter array by shuffling the index vector. Then it generates one pattern at a time. Not like the earlier instance, this generator will finish when the samples from the array are exhausted.
You’ll be able to create a dataset from the perform utilizing from_generator()
. It is advisable present the identify of the generator perform (as a substitute of an instantiated generator) and in addition the output signature of the dataset. That is required as a result of the tf.knowledge.Dataset
API can not infer the dataset spec earlier than the generator is consumed.
Working the above code will print the identical spec as earlier than:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
Such a dataset is functionally equal to the dataset that you just created beforehand. Therefore you should use it for coaching as earlier than. The next is the entire code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
import matplotlib.pyplot as plt import numpy as np import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data()  def shuffle_generator(picture, label, seed):     idx = np.arange(len(picture))     np.random.default_rng(seed).shuffle(idx)     for i in idx:         yield picture[i], label[i]  dataset = tf.knowledge.Dataset.from_generator(     shuffle_generator,     args=[train_image, train_label, 42],     output_signature=(         tf.TensorSpec(form=(28,28), dtype=tf.uint8),         tf.TensorSpec(form=(), dtype=tf.uint8)))  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  historical past = mannequin.match(dataset.batch(32),                     epochs=50,                     validation_data=(test_image, test_label),                     verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Dataset with Prefetch
The actual advantage of utilizing a dataset is to make use of prefetch()
.
Utilizing a NumPy array for coaching might be the perfect in efficiency. Nonetheless, this implies you might want to load all knowledge into reminiscence. Utilizing a generator perform for coaching means that you can put together one batch at a time, through which the information will be loaded from disk on demand, for instance. Nonetheless, utilizing a generator perform to coach a Keras mannequin means both the coaching loop or the generator perform is operating at any time. It’s not simple to make the generator perform and Keras’s coaching loop run in parallel.
Dataset is the API that permits the generator and the coaching loop to run in parallel. When you’ve got a generator that’s computationally costly (e.g., doing picture augmentation in realtime), you possibly can create a dataset from such a generator perform after which use it with prefetch()
, as follows:
... historical past = mannequin.match(dataset.batch(32).prefetch(3), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â epochs=50, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â validation_data=(test_image, test_label), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â verbose=0) |
The quantity argument to prefetch()
is the dimensions of the buffer. Right here, the dataset is requested to maintain three batches in reminiscence prepared for the coaching loop to eat. At any time when a batch is consumed, the dataset API will resume the generator perform to refill the buffer asynchronously within the background. Subsequently, you possibly can permit the coaching loop and the information preparation algorithm contained in the generator perform to run in parallel.
It’s price mentioning that, within the earlier part, you created a shuffling generator for the dataset API. Certainly the dataset API additionally has a shuffle()
perform to do the identical, however chances are you’ll not wish to use it until the dataset is sufficiently small to slot in reminiscence.
The shuffle()
perform, identical as prefetch()
, takes a buffer-size argument. The shuffle algorithm will fill the buffer with the dataset and draw one factor randomly from it. The consumed factor can be changed with the following factor from the dataset. Therefore you want the buffer as massive because the dataset itself to make a very random shuffle. This limitation is demonstrated with the next snippet:
import tensorflow as tf import numpy as np  n_dataset = tf.knowledge.Dataset.from_tensor_slices(np.arange(10000)) for n in n_dataset.shuffle(10).take(20):     print(n.numpy()) |
The output from the above seems like the next:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
9 6 2 7 5 1 4 14 11 17 19 18 3 16 15 22 10 23 21 13 |
Right here you possibly can see the numbers are shuffled round its neighborhood, and also you by no means see massive numbers from its output.
Additional Studying
Extra concerning the tf.knowledge
dataset will be discovered from its API documentation:
Abstract
On this publish, you’ve gotten seen how you should use the tf.knowledge
dataset and the way it may be utilized in coaching a Keras mannequin.
Particularly, you discovered:
- How one can prepare a mannequin utilizing knowledge from a NumPy array, a generator, and a dataset
- How one can create a dataset utilizing a NumPy array or a generator perform
- How one can use prefetch with a dataset to make the generator and coaching loop run in parallel