Debugging your tensorflow code right (without so many painful mistakes)

Debugging your tensorflow code right (without so many painful mistakes)Galina OlejnikBlockedUnblockFollowFollowingFeb 9When it comes to discussing writing code on tensorflow, it’s always about comparing it to PyTorch, talking about how complex the framework is and why some parts of tf.

contrib work so badly.

Moreover, I know a lot of data scientists, who use the tensorflow library only as a part of pre-written Github repo, which can be cloned and then successfully used.

The reasons for such an attitude to this framework are very different and they’re definitely worth another long-read, but today let’s focus on more pragmatic problems: debugging code written in tensorflow and understanding its main peculiarities.

Core abstractionscomputational graph.

The first abstraction, which makes the framework be able to handle the lazy evaluation paradigm (not eager execution, as it always happens for the “traditional” imperative Python programming) is the computational graph tf.

Graph.

In fact, this approach allows the programmer to create tf.

Tensor (edges) and tf.

Operation (nodes), which are not evaluated immediately, but only when the graph is executed with feed_dict used as an argument.

In fact, such an approach to constructing machine learning models is quite common for different frameworks (for instance, a similar idea is used in Apache Spark) and has different pros and cons, which become very obvious when it comes to actually writing and running the code.

The main and most important advantage is the fact that dataflow graphs allow enabling parallelism and distributed execution quite easily without explicitly forcing the code to run in parallel with the multiprocessing module.

In fact, well-written tensorflow code uses the resources of all of the cores as soon as it is launched without any additional configuration.

However, a very obvious disadvantage of this workflow is the fact that as long as you’re constructing your graph but not running it with some input provided, you can’t ever be sure that it will not crash.

It definitely may crash.

Also, unless you’ve executed the graph, you can’t estimate the running time of it too.

The main components of the computational graph worth talking about are graph collections and graph structure.

Strictly speaking, the graph structure is the set of nodes and edges discussed earlier, and graph collections are sets of variables, which can be grouped logically.

For instance, the common way to retrieve trainable variables of the graph is tf.

get_collection(tf.

GraphKeys.

TRAINABLE_VARIABLES).

session.

The second abstraction is highly correlated with the first one and has a bit more complex idea behind it: tensorflow session tf.

Session is used to represent the connection between the client program and C++ runtime (remember that inside of it tensorflow is all about C++).

Why C++?.Because the mathematical operations implemented with this language can be very well optimized and, as the result, the computational graph operations can be processed with a great performance.

When using low-level tensorflow API (which most of Python developers actually use) tensorflow session is invoked as a context manager: with tf.

Session() as sess: syntax is used.

The session with nothing passed to the constructor (as in the previous example) is using only the resources of the local machine and default tensorflow graph, but it can also access remote devices with distributed tensorflow runtime.

In practice, a graph can’t exist without a session (without session it can’t be executed) and the session always has a pointer to the global graph.

Diving deeper into the details of running the session, the main point worth noticing is the syntax of how we run the session: tf.

Session.

run().

It can have as argument fetch (or list of fetches) which can be tensors, operations or tensor-like objects.

In addition, feed_dict can be passed (this argument is optional and is a mapping (dictionary) of tf.

placeholder objects to their values) together with a set of options.

Possible issues to experience and their presumable solutionssession loading and making the predictions for the pre-trained model.

This is the bottleneck, which took me a few weeks to understand, debug and fix.

I would like to highly concentrate on this issue and describe 2 possible flows for the re-loading pre-trained model (its graph and session) and actually using it.

First of all, what do we actually mean when talking about loading the model?.To do so, we need to, of course, train and save it firstly.

It is usually done with tf.

train.

Saver.

save functionality and, as a result, we have 3 binary files with .

index, .

meta and .

data-00000-of-00001 extensions, which actually contain all of the needed data to restore the session and graph.

To actually load the model saved like this, one needs to restore graph with tf.

train.

import_meta_graph() (the argument is the file with .

meta extension).

After following the steps described in the previous pipeline all of the variables (including so-called “hidden” one, which will be discussed later) will be ported into the current graph.

To actually retrieve some tensor having its name (remember that it’s different due to the scope where it’s been created and the operation it’s the result of ) graph.

get_tensor_by_name() should be executed.

This is the first way.

The second way is a bit more explicit and hard to implement (for my specific architecture of the model I’ve been working on I haven’t managed to execute the graph successfully when using it) and is all about saving graph edges (tensors) explicitly into .

npy or .

npz files and then loading them back into the graph (and assigning proper names according to the scope where they’ve been created).

The problem with this approach is that it works perfectly fine, but has 2 huge cons: first of all, when model architecture becomes significantly complex, it becomes quite hard to control and keep in place all of the weight matrices.

Secondly, there’s a kind of “hidden” kind of tensors, which are created without explicitly initializing them.

For instance, when you create tf.

nn.

rnn_cell.

BasicLSTMCell, it creates all the required weights and biases to implement an LSTM cell under the hood.

All of these variables are assigned names automatically.

 By saying that I mean that kernel and bias tensors are created automatically when the tf.

nn.

rnn_cell.

BasicLSTMCell is initialized.

It may sound okay (as long as these 2 tensors are weights and it seems pretty useful to not create them manually, but rather have the framework handling it), but in fact, in many cases, it is not.

The main problem with such an approach is the fact that when you’re looking at the collections of the graph and see a bunch of variables, which you don’t know the origin of, you don’t actually know what you should save and where to load it.

Strictly speaking, it’s very hard to put hidden variables to the right place in the graph and appropriately operate them.

Harder than it should be.

creating the tensor with the same name twice without any warning (by automatically adding _index ending).

I can’t consider this issue to be as important as the previous one, but this problem definitely bothers me as long as it results in a lot of graph execution errors.

To explain the problem better, let’s given an example of the issue.

For instance, you’re creating tensor using tf.

get_variable(name=’char_embeddings’, dtype=…) in the graph and then saving it and loading back in the new session (and a new script, of course).

You’ve forgotten that this variable was a trainable one and have created it one more time in the same fashion with tf.

get_variable() functionality.

When running the graph, the error that will occur will look like: FailedPreconditionError (see above for traceback): Attempting to use uninitialized value char_embeddings_2.

The reason for that is that, of course, you’ve created an empty variable and not pre-loaded it in the appropriate place of the model, while it can be pre-loaded as long as it is already contained within the graph.

However, no error or warning was raised that the programmer has created the tensor with the same name twice (even Windows would do so).

Maybe this is the point that is crucial to me, but this is the peculiarity of tensorflow and its behavior I don’t really enjoy.

resetting the graph manually when writing unit tests and other problems on them.

Testing the code written in tensorflow is always hard because of a lot of reasons.

The first — most obvious one — is already mentioned at the beginning of this paragraph and may sound quite silly, but for me, it was at least irritating.

Because of the fact that there’s only one default tensorflow graph for all of the tensors of all of the modules accessed during runtime, it makes impossible to test the same functionality with, for instance, different parameters, without resetting the graph.

Actually, it is only one line of the code tf.

reset_default_graph(), but knowing that it should be performed at the top of the majority of the methods this becomes some kind of monkey job and, of course, an obvious sample of code duplication.

I haven’t found any of the possible ways to somehow handle this issue as long as all of the tensors (no matter the scope where they’re created) are always linked to the same graph and there’s no way to isolate them (except for the case when there’s a separate tensorflow graph for each method, but from my point of view it is not the best practice).

Another thing about unit tests that also bothers me a lot is the fact that for the case when some part of the constructed graph should not be executed (it has uninitialized tensors inside of it because the model hasn’t already been trained) one doesn’t really know what should we actually test.

I mean that the arguments to self.

assertEqual() are not clear (should we test the names of the output tensors or their shapes? what if the shapes are None? what if tensor name or shape is not enough to actually make the conclusion that the code works appropriately?).

In my case, I simply end up asserting tensor names, shapes, and dimensions, but I’m sure that for the case when the graph is not executed checking only this part of the functionality is not a reasonable condition.

confusing tensors names.

I would rather call this comment on the tensorflow performance an extraordinary way of whining, but one can’t really say what the name of the resulting tensor after performing some kind of operation will be.

I mean, is the name bidirectional_rnn/bw/bw/while/Exit_4:0 clear for you?.As for me, it is absolutely not.

I get that this tensor is the result of some kind operation made on the backward cell of dynamic bidirectional RNN, but without explicitly debugging the code with breakpoints one can’t find out what the operations were performed and in what order.

Also, the index endings are not understandable as well as long as to realize where did the number 4 came from one needs to read tensorflow docs and dive deep into the computational graph operations made to this specific tensor.

The situation is the same for the “hidden” variables discussed earlier: why do we have bias and kernel names there?.Maybe this is the problem of my qualification and level of skills, but such debugging cases are quite complicated for me.

tf.

AUTO_REUSE, trainable variables, re-compiling the library and other naughty stuff.

The last point of this list is all about small details I had to learn by error and trial method.

The first thing is reuse=tf.

AUTO_REUSE parameter of the scope, which allows to automatically handle already created variables and doesn’t create them twice if they already exist.

In fact, in many of the cases, it can solve the issue described in the second point of this paragraph.

However, when it comes to practice, this parameter should be used with care and only when the programmer knows that some part of the code needs to be run twice or more times.

The second point is dedicated to trainable variables, and most important note here is: all of the tensors are trainable by default.

Sometimes it can make a headache as long as not all of the tensors are desired to be trainable and it is very easy to forget that they all can be trained.

The third thing is just an optimization trick, which I recommend for everybody to do: almost in every case when you’re using the package installed with pip you receive the warning like: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2.

If you see this kind of message, it is the best idea to uninstall the tensorflow and then re-compile it with bazel using the options you’d like.

The main benefit you’ll receive after doing so is the speed of calculations and general performance, which will sufficiently increase (sometimes even 3–4 times).

ConclusionI hope that this long-read will be useful for the data scientists who are developing their first tensorflow models and struggle with some of the non-obvious parts of the framework, which are hardly understandable and quite complicated to debug.

The main point I wanted to say by it is that making a lot of mistakes when working on this library is perfectly fine (and for any other thing it is perfectly fine too) and asking questions, diving deep into the docs and debugging every goddamn line is very much okay too.

As with dancing or swimming, it all comes with practice, and I hope I was able to make this practice a bit more pleasant and interesting.

.

. More details

Leave a Reply