Working with TensorFlow’s TensorBoard tool
Author: Ranjeev Menon, Quant Platform Software Engineer at G-Research
What is TensorBoard?
TensorBoard, developed by the Google Brain team, is an application which visualises the various metrics output from TensorFlow, known as summaries. During the training cycle of a neural network, a researcher might want to visualise how the loss function varies with each iteration. Furthermore, they may wish to see how other metrics (such as accuracy, precision, or recall) improve over time. TensorBoard is an important component of such research, as it enables analysis to happen concurrently with the model-fitting process.
Here are some examples of TensorBoard plugins currently in use:
- Scalars plugin – visualises various metrics plotted against iteration count such as per-epoch, per-training batch, etc.
- Graph plugin – visualises and monitors all the operations in the neural network computation graph for CPU usage, memory consumption, etc.
- Histogram plugin – visualises distributional data output from TensorFlow
At G-Research, we are embracing open-source technologies more and more. In this spirit, I have contributed some useful TensorBoard plugins as a natural extension of my time spent in the Data Science Enablement team. In this article, I’ll discuss the process of writing these plugins and what they do.
I implemented two plugins to both extend the capacity of TensorBoard’s visualisation and to help users deal with TensorBoard’s tendency to consume large amounts of data:
- Deep learning research often requires exploration of a large range of hyperparameter values for each model. Having a visualisation of the hyperparameter surface can aid researchers in determining the optimal set. Such a search space can have many dimensions and have non-trivial geometry, but having an interpretable representation can still be useful
- TensorBoard is quite inefficient at managing its memory usage. This is problematic for a standard research process which could easily have hundreds of runs; UI updates become painfully slow and even startup of the application on a large log directory is cumbersome. GitHub features some interesting discussions on this matter
I aimed to use the plugin system of TensorBoard to provide some solutions to the above problems (or at least mitigate current issues), without having to rewrite large parts of the existing system.
To deploy a plugin once it is developed, the TensorBoard source code must be built with the plugin code integrated since plugins are, in fact, just Python classes with an HTML front-end. Building the source code requires Bazel, Google’s open-source build system for multi-language projects. The downside of using an open-source tool here is an unstable feature set; new versions have often introduced incompatibilities while fixing existing bugs. Fortunately, I found a version that worked for me after performing a binary search of Bazel versions, settling on version 19.0.
Setting up a project to build via Bazel requires two kinds of build definition files:
- WORKSPACE files in the top-level directory linking to remote repositories containing the various source code dependencies of TensorBoard (e.g. Polymer, Node, TypeScript etc.)
- BUILD files in each subdirectory which define the code-module structure in a code repository, with additional metadata such as visibility and module dependencies
Plugin development: visualising hyperparameter search
My custom hyperparameter visualisation plugin was implemented to read configuration files from each run directory, with every such file containing a map from hyperparameter names to their values. In addition to reading the scalar data, the plugin would aggregate them into series data and then group the series data so that the variation of the metric could be seen with respect to varying only a single parameter at a time.
Additional features included:
- The ability to create and delete groups of plots at the click of a button
- Using a regex to filter on metric name and parameter name for each plot group
- Selecting how to aggregate metric data per run and how to group / aggregate series data within an individual plot
All this was possible to do within the Polymer framework, using existing components implemented for other plugins. However, there were a few challenges to overcome:
- TensorBoard’s own implementation of charts is quite over-engineered and specific to plotting of scalar values against time within a single run. Hyperparameter plots encode data from many runs, so using their charts required implementing abstractions over the TensorBoard concept of a run to instead be a grouping of data into a series (with a maximum of one varying hyperparameter)
Developing this plugin provided a learning experience for how I believe the TensorBoard UI library should be implemented for simpler plugin development. There should be UI components which are suitably decoupled from the internal model, with any plugin-specific components simply wrapping these base components with model-related behaviour (e.,g. relating to runs). Current convention is to implement UI components for each specific plugin, with little compatibility in the data contracts between them.
Plugin development: mitigating large runs behaviour
Providing a UI to allow fine-grained control of from where TensorBoard data runs presented a few challenges which have been rooted mainly in the core architecture. To understand these challenges, some background knowledge is required:
- A run is a subdirectory of the main log directory, which contains at least one event file where multiple events files will be read into the same run, ordered by their timestamps
- To each run, there is an associated EventAccumulator which uses reservoir sampling to store only a small sample of the data in memory. Further explanation can be found here
- An EventMultiplexer contains a mapping from the run name to its associated accumulator.
The main problem is that, despite restricting the amount of data that each accumulator can store, all runs in the log directory will be loaded on startup; it is quite easy to generate a dataset of O(100) runs, each containing moderate amounts of data which can cripple TensorBoard’s performance.
My original workaround involved restricting the number of runs that TensorBoard loaded on startup and providing a plugin to select/deselect runs which TensorBoard has loaded. Additional features included:
- Allowing regex filtering of runs into groups which would be selected or deselected at once
- Fine-grained regex filtering of enabled groups, so only runs can be searched for and toggled
The UI side of things was relatively straightforward, only requiring formatting of checkboxes and regex text inputs.
The complexity arose when determining how to enable or disable a run from TensorBoard’s view of the data. Fortunately, every plugin has access to the EventMultiplexer object which provides methods for adding and deleting runs. TensorBoard also allows the user to set the reload interval of the accumulators, which is the time between reloads of the accumulators in the background (the default is five seconds).
* –reload_interval=0 means that runs are only loaded once on startup and never again. This was problematic because all reloading would then have to be triggered from the plugin rather than happening in the background.
* –reload_interval=-1 means that no runs are loaded on startup, but they are never reloaded once runs are added again.
Anything bespoke requires forking the code and manually inserting logic to preload a small set of runs (or the reloading of runs that are subsequently added). So, rather than effectively forking the TensorBoard to change a few internal parameters, I tried a different approach; I added symlinks from a temporary log directory into the actual one. TensorBoard would only see the directory of symlinks, which would be a subset of the actual runs, being togglable via the UI. A caveat to make this work involved replacing the log directory that all the TensorBoard plugins access with this new one before launching the server. This allowed for minimal overhead when deleting runs, which was done asynchronously. Here are a few problems I encountered when using this approach:
- Symlinks are fast to create and delete, so adding and deleting runs was relatively performant, but the method to create them is file system-dependent. Python’s native API for creating symlinks only works for file systems which are local to the machine on which TensorBoard is running. If the log directory resides on a network drive, which would be the case for many research clusters utilising a distributed file system, symlinks cannot be created across the network boundary
- It is possible to copy run directories instead. This provides a workaround for having to load all the runs on startup but means asynchronous reloading of runs, and must determine which runs are in the temporary log directory and copy over the data on every reload. This was not a problem for the symlink implementation, as runs were just linking to the actual data folder, so TensorBoard could reload runs in the background and the changes would be automatically visible
In the end, I opted for the more complicated, run-directory-copying approach, since it circumvented the filesystem compatibility issues, which were the largest hindrance to the symlink-approach.
Working with TensorBoard: conclusion
Attempting to overcome the memory problems of TensorBoard has taught me a great deal about how TensorBoard works internally, while highlighting some of the key features in its design which had made development harder. Had I decided from the beginning to fork TensorBoard to modify its backend, I would address the following points:
- There isn’t much of a need to store runs’ data in memory when we haven’t queried for the data. An experimental feature of TensorBoard and TensorFlow is to write summary data to an SQLite database, having one per run rather than an events file. I would propose having this as a replacement for the accumulators; each accumulator should be an SQLite database file which stores the runs we wish to read in TensorBoard and is well-optimised to reading data from disk and querying from the front-end. Data which arrives while running can be asynchronously appended to the database files
- Queries for data from a subset of runs should only load the data they need, with the multiplexer distributing reads across the runs and then aggregating them as necessary
- Components in the backend of TensorBoard should be easily substitutable and wired together via dependency injection so that unit testing can be done. Their interfaces should be well-defined and should only fulfil a single responsibility (e.g. launch a TensorBoard server, abstract over a run database file etc.)
- TensorBoard plugin configuration, as well as component configuration, should be done in code ideally, rather than through command lines. This would provide a single source of truth regarding run reloading behaviour, which doesn’t need forking of code to modify.