TensorBoard (Web) Tutorial - KENET HPC Cluster

Overview

TensorBoard is TensorFlow's visualization toolkit that provides comprehensive visual insights into machine learning experiments. It allows you to track and visualize metrics like loss and accuracy, visualize model computational graphs, view histograms of weights and biases, project embeddings to lower dimensional spaces, and display images, text, and audio data. TensorBoard works with TensorFlow, PyTorch, and other frameworks through appropriate integrations.

Use Cases: TensorBoard excels at visualizing neural network training progress in real-time, comparing different model architectures and hyperparameters, debugging training issues like vanishing gradients or overfitting, profiling model performance and identifying bottlenecks, analyzing model internals through weight distributions, exploring high-dimensional embeddings with dimensionality reduction, and creating publication-quality visualizations of training metrics.

Access: TensorBoard is available through the KENET Open OnDemand web portal at https://ondemand.vlab.ac.ke

Code Examples: All code examples for this tutorial are available in our GitHub repository at https://github.com/Materials-Modelling-Group/training-examples

Prerequisites

Before using TensorBoard, you should have an active KENET HPC cluster account with access to Open OnDemand. Understanding of deep learning concepts and neural networks is important. Experience with TensorFlow or PyTorch will be very helpful. You should have training logs generated from your machine learning experiments stored in a logs directory on the cluster.

Launching TensorBoard

Step 1: Access Interactive Apps

Log into Open OnDemand at https://ondemand.vlab.ac.ke. Click the Interactive Apps menu and select TensorBoard from the dropdown list.

File:OOD TensorBoard Menu.png

Navigate to Interactive Apps → TensorBoard

Step 2: Configure Job Parameters

Configure the session parameters based on the size of your log files and expected usage.

Parameter	Description	Recommended Value
Partition	Queue for job execution	`normal` for visualization
Walltime	Maximum runtime in hours	`4-8` hours for active projects
CPU Cores	Number of processor cores	`2-4` cores sufficient
Memory	RAM allocation	`8-16` GB depending on log size
Log Directory	Path to TensorBoard logs	`/home/username/logs` or your logs folder

File:OOD TensorBoard Form.png

TensorBoard configuration form with log directory

Important: Specify the correct log directory where your TensorFlow or PyTorch logging callbacks have saved data. This is typically a directory containing subdirectories for different training runs.

Step 3: Connect to TensorBoard

Once the status shows "Running," click the Connect to TensorBoard button. The TensorBoard web interface opens, displaying visualizations of your training data.

File:TensorBoard Interface.png

TensorBoard web interface

Quick Start Guide

Understanding TensorBoard Tabs

TensorBoard organizes visualizations into several tabs. The Scalars tab shows time series of metrics like loss and accuracy. The Graphs tab visualizes your model's computational graph. The Distributions tab shows how tensor values change over time. The Histograms tab provides a different view of distribution data. The Images tab displays image data logged during training. The Embeddings tab provides interactive visualization of high-dimensional data using dimensionality reduction.

Logging Data for TensorBoard

To use TensorBoard, you must first generate logs during model training. The tensorboard/examples/01_tensorflow_logging.py file demonstrates setting up TensorFlow callbacks to log data, specifying the log directory, running training to generate logs, and launching TensorBoard to view results.

For PyTorch, the process is similar using the SummaryWriter from torch.utils.tensorboard as shown in tensorboard/examples/02_pytorch_logging.py.

Navigating the Interface

Use the left sidebar to select which runs to display and adjust visualization settings. The main panel shows the selected visualization type. Toggle between different metrics using the dropdown menus. Use the smoothing slider to reduce noise in metric plots. Hover over data points to see exact values. Use the download button to export plots as images or data as CSV files.

Comparing Training Runs

One of TensorBoard's most powerful features is comparing multiple training runs. Organize runs in separate subdirectories within your log directory. Each subdirectory appears as a separate run in TensorBoard. Select multiple runs using checkboxes to overlay their metrics on the same plot. This makes it easy to compare different hyperparameters, architectures, or training strategies.

Common Tasks

Task 1: Monitoring Training Progress

The most common use of TensorBoard is monitoring model training in real-time. The tensorboard/examples/03_monitoring_training.py file demonstrates logging training and validation loss, recording accuracy metrics, tracking learning rate schedules, monitoring training time per epoch, and viewing updates in real-time as training progresses.

Launch TensorBoard before or during training and refresh the browser to see new data points appear. This helps identify issues like overfitting or convergence problems early.

Task 2: Visualizing Model Architecture

TensorBoard can display your neural network's computational graph. The tensorboard/examples/04_model_graph.py file shows enabling graph logging in TensorFlow, viewing the graph in the Graphs tab, understanding node types and connections, identifying potential bottlenecks, and verifying model architecture matches expectations.

The graph visualization helps understand data flow through the network and can reveal inefficiencies or errors in model design.

Task 3: Analyzing Weight Distributions

Understanding how weights evolve during training provides insights into learning dynamics. The tensorboard/examples/05_weight_distributions.py file demonstrates logging weight and bias histograms, viewing distributions over time, identifying vanishing or exploding gradients, monitoring batch normalization statistics, and comparing distributions across layers.

Use the Distributions and Histograms tabs to view this data. Healthy training typically shows weights evolving smoothly without sudden changes or saturation.

Task 4: Hyperparameter Comparison

When running hyperparameter searches, TensorBoard helps visualize results across all configurations. The tensorboard/examples/06_hyperparameter_comparison.py file shows organizing runs by hyperparameter configuration, using run names to encode hyperparameter values, comparing metrics across all runs simultaneously, identifying optimal hyperparameter ranges, and exporting comparison data for further analysis.

This systematic approach to hyperparameter tuning makes it easy to understand which parameters matter most for model performance.

Task 5: Embedding Visualization

For high-dimensional data like word embeddings or image features, TensorBoard provides interactive dimensionality reduction. The tensorboard/examples/07_embeddings.py file demonstrates logging embedding data with metadata, using t-SNE or PCA for visualization, coloring points by labels or attributes, exploring neighborhoods in embedding space, and understanding learned representations.

This is particularly useful for understanding what your model has learned and validating that similar items are close in embedding space.

Tips & Best Practices

Logging Strategy

Log data at appropriate intervals to balance detail with performance. For training metrics, logging every epoch or every few batches is usually sufficient. Avoid logging too frequently as it creates large log files and slows visualization. Use different log directories for different experiments to keep them organized. Include meaningful names in run directories like "model_v2_lr0.001_batch64" to make identification easy.

Store logs in fast storage like scratch space during training, then move to permanent storage for long-term keeping.

Organizing Experiments

Create a hierarchical directory structure for logs. Use a top-level directory for the project, subdirectories for different model types or approaches, and individual run directories within those. This organization makes it easy to compare related runs while keeping unrelated experiments separate.

Document your experiment structure in a README file in the logs directory explaining the naming conventions and organization.

Performance Optimization

For very large models or long training runs, log files can become quite large. Reduce logging frequency for less critical metrics. Use the purge_step parameter in TensorFlow to remove old data and keep file sizes manageable. Compress old log directories that are no longer actively viewed. When launching TensorBoard, point it to specific experiment directories rather than the entire logs tree if you only need to view certain runs.

TensorBoard loads data lazily, so it should handle large log directories reasonably well, but more focused views will be faster.

Troubleshooting Training

TensorBoard is invaluable for debugging training issues. If loss is not decreasing, check for flat regions in the loss curve suggesting the learning rate is too low, or erratic jumps suggesting it is too high. If validation loss increases while training loss decreases, you are overfitting and need more regularization. If gradients shown in histograms are very small, you may have vanishing gradients. If weight distributions stop changing, the network may have stopped learning.

Compare problematic runs with successful ones to identify differences in behavior.

Example Workflows

Example 1: Complete Training Monitoring

Objective: Monitor a complete deep learning training run from start to finish with comprehensive logging.

Follow the workflow in tensorboard/workflows/01_complete_monitoring.py which demonstrates setting up callbacks for all relevant metrics, monitoring training and validation performance, tracking computational efficiency, visualizing model architecture, analyzing weight evolution, identifying when to stop training, and exporting final results and visualizations.

This comprehensive approach ensures you capture all information needed to understand and reproduce training runs.

Example 2: Hyperparameter Grid Search

Objective: Run a systematic hyperparameter search and use TensorBoard to identify the best configuration.

The tensorboard/workflows/02_grid_search_tensorboard.py workflow demonstrates defining hyperparameter grid, running training for each configuration, logging all runs to organized directories, launching TensorBoard to compare results, identifying best performing configurations, analyzing relationships between hyperparameters and performance, and documenting optimal settings.

TensorBoard makes it much easier to understand hyperparameter search results than reviewing raw numbers.

Example 3: Transfer Learning Visualization

Objective: Visualize how fine-tuning affects a pre-trained model.

Follow tensorboard/workflows/03_transfer_learning_viz.py which shows logging pre-training baseline metrics, comparing frozen versus unfrozen layer training, visualizing feature evolution through fine-tuning, monitoring learning in different network layers, and understanding which layers benefit most from fine-tuning.

This helps optimize transfer learning strategies and understand what the model learns during adaptation to new tasks.

Troubleshooting

Problem: TensorBoard shows "No dashboards active"

This message appears when no log data is found in the specified log directory. Verify the log directory path is correct and contains TensorBoard log files. Check that training has actually generated logs by looking for files in the directory. Ensure you pointed TensorBoard to the parent directory containing run subdirectories, not a specific run directory. Try manually specifying the logdir when launching TensorBoard.

Problem: Plots not updating with new data

TensorBoard should automatically detect new log data, but sometimes requires refresh. Click the refresh button in the top-right of the TensorBoard interface. Check that the training process is actually writing new log data. Verify file permissions allow TensorBoard to read the log files. For very large log directories, TensorBoard may take time to scan for updates.

Problem: Port already in use error

If TensorBoard fails to start because the port is already in use, another TensorBoard instance may be running. Check for existing TensorBoard sessions in Open OnDemand and terminate unused ones. The port is automatically selected when launching through Open OnDemand, so this should be rare.

Problem: Out of memory when viewing logs

Very large log files or many simultaneous runs can exhaust memory. Request more memory when launching the TensorBoard session. Reduce the number of runs displayed simultaneously by using filters. Point TensorBoard to specific experiment subdirectories rather than the entire logs tree. Archive or compress old runs that are no longer needed.

Problem: Slow loading or visualization

Performance issues usually result from very large log files or many runs. Reduce logging frequency in future experiments. Use the reload interval setting to control how often TensorBoard checks for updates. Close unused browser tabs showing TensorBoard. Clear browser cache if the interface becomes sluggish. For very large experiments, consider using TensorBoard.dev for cloud-based viewing.

Additional Resources

The official TensorBoard documentation is available at https://www.tensorflow.org/tensorboard and provides detailed guides for all features. The TensorBoard GitHub repository at https://github.com/tensorflow/tensorboard contains examples and issue discussions. For deep learning best practices, consult the TensorFlow tutorials at https://www.tensorflow.org/tutorials and PyTorch documentation at https://pytorch.org/docs/.

Code Examples Repository: All code examples referenced in this tutorial are available at https://github.com/Materials-Modelling-Group/training-examples

For support, contact KENET at support@kenet.or.ke, consult the documentation at https://training.kenet.or.ke, or access the Open OnDemand portal at https://ondemand.vlab.ac.ke.

Version Information

This tutorial documents TensorBoard version 2.x running on the KENET HPC cluster. Compatible with TensorFlow 2.x and PyTorch 1.x or later. This tutorial was last updated on 2026-01-09 and is currently at version 1.0.

Back to: Easy HPC access with KENET Open OnDemand

Tensorboard

Contents