Mlflow

From KENET Training
Revision as of 14:39, 10 January 2026 by Atambo (talk | contribs) (Created page with "= MLflow (Web) Tutorial - KENET HPC Cluster = == Overview == '''MLflow''' is an open-source platform for managing the machine learning lifecycle. It provides tools for tracki...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

MLflow (Web) Tutorial - KENET HPC Cluster

Overview

MLflow is an open-source platform for managing the machine learning lifecycle. It provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. The web-based interface allows you to visualize and compare experiments, making it easier to understand model performance and iterate on machine learning projects.

Use Cases: MLflow is particularly valuable for tracking machine learning experiments with parameters and metrics, comparing different model architectures and hyperparameters, versioning and organizing trained models, creating reproducible ML workflows, collaborating with team members on ML projects, and maintaining a history of model development for research documentation.

Access: MLflow is available through the KENET Open OnDemand web portal at https://ondemand.vlab.ac.ke

Code Examples: All code examples for this tutorial are available in our GitHub repository at https://github.com/Materials-Modelling-Group/training-examples


Prerequisites

Before using MLflow, you should have an active KENET HPC cluster account with access to Open OnDemand. Basic knowledge of machine learning concepts and Python programming is essential. Familiarity with scikit-learn, TensorFlow, or PyTorch will be helpful. You should also have a JupyterLab or VS Code session running where you will train models and log experiments to MLflow.


Launching MLflow

Step 1: Access Interactive Apps

Log into Open OnDemand at https://ondemand.vlab.ac.ke. Click the Interactive Apps menu and select MLflow from the dropdown list.

File:OOD MLflow Menu.png
Navigate to Interactive Apps → MLflow

Step 2: Configure Job Parameters

MLflow itself is lightweight, but configure resources based on how you will use it.

Parameter Description Recommended Value
Partition Queue for job execution normal for tracking server
Walltime Maximum runtime in hours 8-24 hours for project duration
CPU Cores Number of processor cores 2-4 cores sufficient for tracking
Memory RAM allocation 8 GB for typical usage
Working Directory Where to store mlruns Your project directory
File:OOD MLflow Form.png
MLflow configuration form

Step 3: Connect to MLflow UI

Once the session status shows "Running," click the Connect to MLflow button. This opens the MLflow web interface where you can view experiments, compare runs, and manage models.

File:MLflow Interface.png
MLflow tracking UI

Quick Start Guide

Understanding MLflow Components

MLflow consists of several key components. MLflow Tracking logs parameters, metrics, and artifacts from your experiments. The Models component packages ML models in multiple formats for deployment. The Projects component packages code in a reusable, reproducible form. The Model Registry provides a central model store for managing model versions.

For most users, the Tracking component accessed through the web UI is the primary interface. This allows you to visualize experiment history, compare runs, and analyze model performance.

Logging Your First Experiment

From a JupyterLab notebook or Python script, connect to your MLflow tracking server and log a simple experiment. The mlflow/examples/01_basic_tracking.py file demonstrates setting the tracking URI, creating experiments, logging parameters and metrics, and viewing results in the web UI.

The tracking URI points to your MLflow server. Check the connection.yml file in your session directory to find the correct hostname and port.

Understanding the Web Interface

The MLflow web interface displays experiments in the left sidebar. Click an experiment to see all runs within it. Each run shows logged parameters, metrics, artifacts, and metadata. Use the compare feature to view multiple runs side by side. The charts visualize metric progression over time or across different runs.

Organizing Experiments

Create separate experiments for different projects or model families. Within each experiment, each training run creates a new entry. Tag runs with descriptive names and notes to make them easier to find later. Use the search and filter features to find specific runs based on parameters or metrics.


Common Tasks

Task 1: Tracking Model Training

The most common use of MLflow is tracking model training experiments. The mlflow/examples/02_model_training.py file demonstrates logging hyperparameters before training, recording metrics during and after training, saving model artifacts, logging the trained model with MLflow's model logging functions, and adding tags and notes for organization.

This creates a complete record of each training run that can be reviewed and compared later. The web UI shows all this information in an organized, searchable format.

Task 2: Comparing Multiple Runs

When experimenting with different hyperparameters or model architectures, MLflow makes comparison easy. The mlflow/examples/03_hyperparameter_tuning.py file shows running multiple training iterations with different parameters, logging each run to the same experiment, using the web UI to compare results, identifying the best performing configuration, and exporting comparison results.

In the web UI, select multiple runs using checkboxes and click the Compare button to see side-by-side parameter and metric comparisons with visualization of metric differences.

Task 3: Logging Artifacts

Beyond parameters and metrics, you can log any file as an artifact. The mlflow/examples/04_logging_artifacts.py file demonstrates logging plots and visualizations, saving confusion matrices and performance charts, logging dataset samples, saving feature importance information, and storing any supporting files.

Artifacts are stored in the mlruns directory and can be downloaded from the web UI for further analysis or inclusion in reports.

Task 4: Model Registry and Versioning

For production workflows, MLflow's model registry tracks model versions and their lifecycle stages. The mlflow/examples/05_model_registry.py file shows registering trained models, promoting models through stages like staging and production, adding descriptions and metadata, downloading registered models for deployment, and managing model versions.

This is particularly useful when working with teams to track which model versions are being used where.

Task 5: Auto-logging with Frameworks

MLflow provides auto-logging for popular frameworks that automatically captures parameters and metrics. The mlflow/examples/06_autolog.py file demonstrates enabling auto-logging for scikit-learn, using auto-logging with TensorFlow and Keras, auto-logging PyTorch models, and combining auto-logging with custom logging.

Auto-logging reduces boilerplate code and ensures consistent tracking across projects.


Tips & Best Practices

Experiment Organization

Create meaningful experiment names that reflect the project or research question. Use consistent naming conventions for runs within an experiment. Tag runs with relevant metadata like dataset version, preprocessing steps, or model architecture. Add descriptive notes explaining the purpose of each run or any unusual configurations.

Keep your mlruns directory organized and backed up. Consider using version control for your code alongside MLflow tracking for complete reproducibility.

What to Log

Log all hyperparameters that affect model behavior. Record both training and validation metrics to detect overfitting. Save learning curves showing metric progression during training. Log model evaluation metrics on test sets. Include computational metrics like training time and resource usage. Save artifacts like confusion matrices, ROC curves, and feature importance plots.

Avoid logging too much data in metrics which can slow down the UI. Instead, save detailed data as artifacts.

Performance Considerations

MLflow tracking has minimal overhead, but logging very frequently can slow training. Log metrics at reasonable intervals rather than every iteration. Use batch logging when possible. For very long experiments, consider logging to local storage first and syncing to the tracking server periodically.

Store mlruns directory in fast storage like scratch space for better performance with large experiments.

Collaboration

When working with a team, agree on experiment naming conventions. Use the model registry to communicate which models are ready for deployment. Add detailed notes to runs explaining decisions and results. Export important comparisons as CSV or images for sharing in presentations or papers.


Example Workflows

Example 1: Complete ML Experiment

Objective: Track a complete machine learning project from data exploration to final model.

Follow the workflow in mlflow/workflows/01_complete_experiment.py which demonstrates creating an experiment for the project, logging data exploration steps, training multiple model types, comparing results in MLflow UI, selecting the best model, registering it in the model registry, and generating a final report with exported metrics.

This workflow shows how to use MLflow throughout the entire ML development process.

Example 2: Hyperparameter Optimization

Objective: Systematically tune model hyperparameters and track all trials.

The mlflow/workflows/02_hyperparameter_optimization.py workflow demonstrates defining hyperparameter search space, running grid search or random search, logging each trial to MLflow, using the UI to identify best parameters, analyzing parameter importance, and documenting optimal configuration.

MLflow's comparison features make it easy to understand which parameters have the most impact on performance.

Example 3: Deep Learning Experiment Tracking

Objective: Track neural network training with TensorFlow or PyTorch including learning curves and model checkpoints.

Follow mlflow/workflows/03_deep_learning_tracking.py which shows setting up auto-logging for deep learning frameworks, logging training and validation metrics per epoch, saving model checkpoints as artifacts, logging architecture diagrams, tracking GPU utilization, and comparing different network architectures.

This workflow leverages MLflow's deep learning integrations for comprehensive experiment tracking.


Troubleshooting

Problem: Cannot connect to MLflow server

If your Python code cannot connect to the MLflow tracking server, verify the tracking URI is correct by checking the connection.yml file in your MLflow session directory. The URI should be in the format http://hostname:port. Ensure your JupyterLab or VS Code session is running on a compute node that can reach the MLflow server. If using localhost in the URI, this will only work if MLflow and your code are on the same node.

Problem: Experiments not appearing in UI

If experiments do not appear in the web interface, check that you are logging to the correct tracking URI. Verify the mlruns directory exists and has the expected structure. Refresh the web UI to see new experiments. Check file permissions on the mlruns directory. If experiments were logged before the tracking server started, they should appear after refresh.

Problem: Artifacts not uploading

Artifact upload failures usually indicate permission or disk space issues. Check available disk space in your working directory. Verify write permissions on the mlruns directory. For very large artifacts, consider logging them to a shared storage location. Check that the artifact path in your code is correct and the file exists.

Problem: Slow UI performance

The MLflow UI can become slow with thousands of runs. Archive old experiments by moving their directories out of mlruns. Delete unnecessary runs from experiments. Use filtering and search instead of displaying all runs. For very large deployments, consider using a database backend instead of file storage. Limit the number of metrics logged per run.

Problem: Model loading fails

If loading a logged model fails, ensure the same Python environment and package versions are available. Check that required dependencies were logged with the model. Verify the model path in mlruns exists and contains the expected files. Some models require specific frameworks to be imported before loading. Check the flavor (sklearn, pytorch, etc.) matches what you are trying to load.


Additional Resources

The official MLflow documentation is available at https://mlflow.org/docs/latest/index.html and provides comprehensive guides and API reference. The MLflow GitHub repository at https://github.com/mlflow/mlflow contains examples and community discussions. For machine learning best practices, consult scikit-learn documentation at https://scikit-learn.org/ and TensorFlow at https://www.tensorflow.org/.

Code Examples Repository: All code examples referenced in this tutorial are available at https://github.com/Materials-Modelling-Group/training-examples

For support, contact KENET at support@kenet.or.ke, consult the documentation at https://training.kenet.or.ke, or access the Open OnDemand portal at https://ondemand.vlab.ac.ke.


Version Information

This tutorial documents MLflow version 2.x running on the KENET HPC cluster. This tutorial was last updated on 2026-01-09 and is currently at version 1.0.


Back to: Easy HPC access with KENET Open OnDemand