This post is part of a series of post explaining how to structure a deep learning project in TensorFlow. We will explain here how to easily define a deep learning model in TensorFlow using tf.layers
, and how to train it. The entire code examples can be found in our github repository.
This tutorial is among a series explaining how to structure a deep learning project:
- installation, get started with the code for the projects
- (TensorFlow only): explain the global structure of the code
- (TensorFlow only): how to feed data into the model using
tf.data
- this post: how to create the model and train it
Goals of this tutorial
- learn more about TensorFlow
- learn how to easily build models using
tf.layers
- …
Defining the model
Great, now we have this input
dictionnary containing the Tensor corresponding to the data, let’s explain how we build the model.
Introduction to tf.layers
This high-level Tensorflow API lets you build and prototype models in a few lines. You can have a look at the official tutorial for computer vision, or at the list of available layers. The idea is quite simple so we’ll just give an example.
Let’s get an input Tensor with a similar mechanism than the one explained in the previous part. Remember that None corresponds to the batch dimension.
#shape = [None, 64, 64, 3]
images = inputs["images"]
Now, let’s apply a convolution, a relu activation and a max-pooling. This is as simple as
out = images
out = tf.layers.conv2d(out, 16, 3, padding='same')
out = tf.nn.relu(out)
out = tf.layers.max_pooling2d(out, 2, 2)
Finally, use this final tensor to predict the labels of the image (6 classes). We first need to reshape the output of the max-pooling to a vector
#First, reshape the output into [batch_size, flat_size]
out = tf.reshape(out, [-1, 32 * 32 * 16])
#Now, logits is [batch_size, 6]
logits = tf.layers.dense(out, 6)
Note the use of -1
: Tensorflow will compute the corresponding dimension so that the total size is preserved.
The logits will be unnormalized scores for each example.
In the code examples, the transformation from inputs
to logits
is done in the build_model
function.
Training ops
At this point, we have defined the logits
of the model. We need to define our predictions, our loss, etc. You can have a look at the model_fn
in model/model_fn.py
.
#Get the labels from the input data pipeline
labels = inputs['labels']
labels = tf.cast(labels, tf.int64)
#Define the prediction as the argmax of the scores
predictions = tf.argmax(logits, 1)
#Define the loss
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
The 1
in tf.argmax
tells Tensorflow to take the argmax on the axis = 1 (remember that axis = 0 is the batch dimension)
Now, let’s use Tensorflow built-in functions to create nodes and operators that will train our model at each iteration !
#Create an optimizer that will take care of the Gradient Descent
optimizer = tf.train.AdamOptimizer(0.01)
#Create the training operation
train_op = optimizer.minimize(loss)
All these nodes are created by model_fn
that returns a dictionnary model_spec
containing all the necessary nodes and operators of the graph. This dictionnary will later be used for actually running the training operations etc.
And that’s all ! Our model is ready to be trained. Remember that all the objects we defined so far are nodes or operators that are part of the Tensorflow graph. To evaluate them, we actually need to execute them in a session. Simply run
with tf.Session() as sess:
for i in range(num_batches):
_, loss_val = sess.run([train_op, loss])
Notice how we don’t need to feed data to the session as the tf.data
nodes automatically iterate over the dataset ! At every iteration of the loop, it will move to the next batch (remember the tf.data
part), compute the loss, and execute the train_op
that will perform one update of the weights !
For more details, have a look at the model/training.py
file that defines the train_and_evaluate
function.
Putting input_fn and model_fn together
To summarize the different steps, we just give a high-level overview of what needs to be done in train.py
#1. Create the iterators over the Training and Evaluation datasets
train_inputs = input_fn(True, train_filenames, train_labels, params)
eval_inputs = input_fn(False, eval_filenames, eval_labels, params)
#2. Define the model
logging.info("Creating the model...")
train_model_spec = model_fn('train', train_inputs, params)
eval_model_spec = model_fn('eval', eval_inputs, params, reuse=True)
#3. Train the model (where a session will actually run the different ops)
logging.info("Starting training for {} epoch(s)".format(params.num_epochs))
train_and_evaluate(train_model_spec, eval_model_spec, args.model_dir, params, args.restore_from)
The train_and_evaluate
function performs a given number of epochs (= full pass on the train_inputs
). At the end of each epoch, it evaluates the performance on the development set (dev
or train-dev
in the course material).
Remember the discussion about different graphs for Training and Evaluation. Here, notice how the eval_model_spec
is given the reuse=True
argument. It will make sure that the nodes of the Evaluation graph which must share weights with the Training graph do share their weights.
Evalution and tf.metrics
So far, we’ve explained how we input data to the graph, how we define the different nodes and training ops, but we don’t know (yet) how to compute some metrics on our dataset. There are basically 2 possibilities
- [run evaluation outside the Tensorflow graph] Evaluate the prediction over the dataset by running
sess.run(prediction)
and use it to evaluate your model (without Tensorflow, with pure python code). This option can also be used if you need to write a file with all the predicitons and use a script (distributed by a conference for instance) to evaluate the performance of your model. - [use Tensorflow] As the above method can be quite complicated for simple metrics, Tensorflow luckily has some built-in tools to run evaluation. Again, we are going to create nodes and operations in the Graph. The concept is simple: we will use the
tf.metrics
API to build those, the idea being that we need to update the metric on each batch. At the end of the epoch, we can just query the updated metric !
We’ll cover method 2 as this is the one we implemented in the code examples (but you can definitely go with option 1 by modifying model/evaluation.py
). As most of the nodes of the graph, we define these metrics nodes and ops in model/model_fn.py
.
#Define the different metrics
with tf.variable_scope("metrics"):
metrics = {'accuracy': tf.metrics.accuracy(labels=labels, predictions=predictions,
'loss': tf.metrics.mean(loss)}
#Group the update ops for the tf.metrics, so that we can run only one op to update them all
update_metrics_op = tf.group(*[op for _, op in metrics.values()])
#Get the op to reset the local variables used in tf.metrics, for when we restart an epoch
metric_variables = tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope="metrics")
metrics_init_op = tf.variables_initializer(metric_variables)
Notice that we define the metrics, a grouped update op and an initializer. The use of the *
in tf.group
is a pythonic way to tell that the argument given to the function corresponds to an optional positional argument.
Notice also how we define the metrics in a special variable_scope
so that we can query the variables by name when we create the initializer ! When you create nodes, the variables are added to some pre-defined collections of variables (TRAINABLE_VARIABLES, etc.). The variables we need to reset for tf.metrics
are in the tf.GraphKeys.LOCAL_VARIABLES
collection. Thus, to query the variables, we get the collection of variables in the right scope !
Now, to evaluate the metrics on a dataset, we’ll just need to run them in a session as we loop over our dataset
with tf.Session() as sess:
#Run the initializer to reset the metrics to zero
sess.run(metrics_init_op)
#Update the metrics over the dataset
for _ in range(num_steps):
sess.run(update_metrics_op)
#Get the values of the metrics
metrics_values = {k: v[0] for k, v in metrics.items()}
metrics_val = sess.run(metrics_values)
And that’s all ! If you want to compute new metrics for which you can find a Tensorflow implementation, you can define it in the model_fn.py
(add it to the metrics
dictionnary). It will automatically be updated during the training and will be displayed at the end of each epoch.
Tensorflow Tips and Tricks
Be careful with initialization
So far, we mentionned 3 different initializer operators.
#1. For all the variables (the weights etc.)
tf.global_variables_initializer()
#2. For the dataset, so that we can chose to move the iterator back at the beginning
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
iterator_init_op = iterator.initializer
#3. For the metrics variables, so that we can reset them to 0 at the beginning of each epoch
metrics_init_op = tf.variables_initializer(metric_variables)
During train_and_evaluate
we perform the following schedule, all in one session
- Loop over the training set, updating the weights and computing the metrics
- Loop over the evaluation set, computing the metrics
- Go back to step 1.
We thus need to run
tf.global_variable_initializer()
at the very beginning (before the first occurence of step 1)iterator_init_op
at the beginning of every loop (step 1 and step 2)metrics_init_op
at the beginning of every loop (step 1 and step 2), to reset the metrics to zero (we don’t want to compute the metrics averaged over the different epochs or different datasets !)
You can indeed check that this is what we do in model/evaluation.py
or model/training.py
when we actually run the graph !
Saving
Training a model and evaluating is fine, but what about re-using the weights? Also, maybe at some point of the training, our performance started to get worse on the validation set and we want to use the best weights we got during training.
Saving models is easy in Tensorflow. Look at the outline below
#We need to create an instance of saver
saver = tf.train.Saver()
for epoch in range(10):
for batch in range(10):
_ = sess.run(train_op)
#Save weights
save_path = os.path.join(model_dir, 'last_weights', 'after-epoch')
saver.save(sess, last_save_path, global_step=epoch + 1)
There is not much to say, except that the saver.save()
method takes a session as input. In our implementation, we use 2 savers. A last_saver = tf.train.Saver()
that will keep the weights at the end of the last 5 epochs and a best_saver = tf.train.Saver(max_to_keep=1)
that only keeps one checkpoint corresponding to the weights that achieved the best performance on the validation set !
Later on, to restore the weights of your model, you need to reload the weights thanks to a saver instance, as in
with tf.Session() as sess:
#Get the latest checkpoint in the directory
restore_from = tf.train.latest_checkpoint("model/last_weights")
#Reload the weights into the variables of the graph
saver.restore(sess, restore_from)
You can look at the files model/training.py and model/evaluation.py for more details.
Tensorboard and summaries
Tensorflow comes with an excellent visualization tool called Tensorboard that enables you to plot different scalars (and much more) in real-time, as you train your model.
The mechanism of Tensorboard is the following
- define some summaries (nodes of the graph) that will tell Tensorflow which values we want to plot
- evaluate these nodes in the
session
- write the output to a file thanks to a
tf.summary.FileWriter
Then, you only need to launch tensorboard in your web-browser by opening a terminal and writing for instance
tensorboard --logdir="expirements/base_model"
Then, navigate to http://127.0.0.1:6006/ and you’ll see the different plots.
In the code examples, we add the summaries in model/model_fn.py
.
#Compute different scalars to plot
loss = tf.reduce_mean(losses)
accuracy = tf.reduce_mean(tf.cast(tf.equal(labels, predictions), tf.float32))
#Summaries for training
tf.summary.scalar('loss', loss)
tf.summary.scalar('accuracy', accuracy)
Note that we don’t use the metrics that we defined earlier. The reason being that the tf.metrics
returns the running average, but Tensorboard already takes care of the smoothing, so we don’t want to add any additional smoothing. It’s actually rather the opposite: we are interested in real-time progress
Once these nodes are added to the model_spec
dictionnary, we need to evaluate them in a session. In our implementation, this is done every params.save_summary_steps
as you’ll notice in the model/training.py
file.
if i % params.save_summary_steps == 0:
#Perform a mini-batch update
_, _, loss_val, summ, global_step_val = sess.run([train_op, update_metrics, loss, summary_op, global_step])
#Write summaries for tensorboard
writer.add_summary(summ, global_step_val)
else:
_, _, loss_val = sess.run([train_op, update_metrics, loss])
You’ll notice that we have 2 different writers
train_writer = tf.summary.FileWriter(os.path.join(model_dir, 'train_summaries'), sess.graph)
eval_writer = tf.summary.FileWriter(os.path.join(model_dir, 'eval_summaries'), sess.graph)
They’ll write summaries for both the training and the evaluation, letting you plot both plots on the same graph !
A note about the global_step
In order to keep track of how far we are in the training, we use one of Tensorflow’s training utilities, the global_step
. Once initialized, we give it to the optimizer.minimize()
as explained below. Thus, each time we will run sess.run(train_op)
, it will increment the global_step by 1. This is very useful for summaries (notice how in the Tensorboard part we give the global step to the writer
).
global_step = tf.train.get_or_create_global_step()
train_op = optimizer.minimize(loss, global_step=global_step)