Difference between revisions of "Intermediate Usage: PyTorch and Tensorflow"
(Created page with "=== Modules For Machine Learning === The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be dif...") |
(No difference)
|
Revision as of 18:54, 12 April 2025
Contents
[hide]Modules For Machine Learning
The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be different from a jupyter notebook interface, since everything has to be run in the background. As a user, you will place all your training/inference/testing/IO code in a python script, which then will be added as a command in the shell script section of the slurm job submission file.
Listing available modules
To view all module available, we can use the Slurm command:
$ module av
/usr/share/modulefiles -------------------------------
mpi/openmpi-x86_64
/opt/ohpc/pub/modulefiles ------------------------------
applications/gpu/gromacs/2024.4 applications/gpu/python/conda-25.1.1-python-3.9.21 (D)
applications/gpu/python/base-3.9.21 applications/gpu/qespresso/7.3.1
/usr/share/lmod/lmod/modulefiles/Core -------------------------
lmod settar
Modules with Tensorflow and PyTorch
This module that appear in the prior list has both TensorFlow and PyTorch installed:
applications/gpu/python/conda-25.1.1-python-3.9.21
Loading The Python Module
We can Load the module using this Slurm command:
module load applications/gpu/python/conda-25.1.1-python-3.9.21
Listing Conda Environments
The loaded module gives us access to a custom conda module, and we can now list the conda environments available
$ conda env list
# conda environments:
#
base /opt/ohpc/pub/conda/instdir
python-3.9.21 /opt/ohpc/pub/conda/instdir/envs/python-3.9.21
we can safely ignore the base environment, and make use of the *python-3.9.21* conda environment, this has the two
machine learning frameworks, Tensorflow and PyTorch.
$ conda activate python-3.9.21
(python-3.9.21)$
This is what we will have in the Slurm submission script.
Lets now create the python code that will run a simple machine learning exercise, with PyTorch. We will use the
MNIST example from PyTorch, run these shell commands to create the working directory and retreive the files:
$ mkdir ~/mnist # creating a working dir
$ cd ~/mnist # changing directory to the working dir
$ wget https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py
And now we can place it in our submission script:
#!/bin/bash
#SBATCH -J gputest # Job name
#SBATCH -o job.%j.out # Name of stdout output file (%j expands to jobId)
#SBATCH -e %j.err # Name of std err
#SBATCH --partition=gpu1 # Queue
#SBATCH --nodes=1 # Total number of nodes requested
#SBATCH --gres=gpu:1 # Total number of gpus requested
#SBATCH --cpus-per-task=1 #
#SBATCH --time=00:03:00 # Run time (hh:mm:ss) - 1.5 hours
cd ~/mnist
module load applications/gpu/python/conda-25.1.1-python-3.9.21
conda activate python-3.9.21
python main.py
Finally we can submit this script to Slurm, which will run the entire process for in the background.