Difference between revisions of "Intermediate Usage: PyTorch and Tensorflow"

From KENET Training
Jump to: navigation, search
Line 76: Line 76:
 
   $ sbatch torch.job
 
   $ sbatch torch.job
 
</code>
 
</code>
 +
 +
== [https://asciinema.org/a/m8HJLldFQk0SrrpOYOAIQQrYj Watch Demo] ==
  
 
Next:
 
Next:

Revision as of 15:13, 8 May 2025

Modules For Machine Learning

The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be different from a jupyter notebook interface, since everything has to be run in the background. As a user, you will place all your training/inference/testing/IO code in a python script, which then will be added as a command in the shell script section of the slurm job submission file.

Listing available modules

To view all module available, we can use the Slurm command:

 $ module av
  ----------------------------------------------------------------- /usr/share/modulefiles -------------------------------
  mpi/openmpi-x86_64
  ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------
  applications/gpu/gromacs/2024.4        applications/gpu/python/conda-25.1.1-python-3.9.21 (D)
  applications/gpu/python/base-3.9.21    applications/gpu/qespresso/7.3.1
  ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core -------------------------
  lmod    settar

Conda logo.svg.png

Modules with Tensorflow and PyTorch

This conda module that appear in the prior list has both TensorFlow and PyTorch installed:

 applications/gpu/python/conda-25.1.1-python-3.9.21

Loading The Python (Conda) Module

We can Load the module using this Slurm command:

 module load applications/gpu/python/conda-25.1.1-python-3.9.21

Listing Conda Environments

The loaded module gives us access to a custom conda module, and we can now list the conda environments available

$ conda env list

 base                   /opt/ohpc/pub/conda/instdir
 python-3.9.21          /opt/ohpc/pub/conda/instdir/envs/python-3.9.21

we can safely ignore the base environment, and make use of the *python-3.9.21* conda environment, this has the two machine learning frameworks, Tensorflow and PyTorch.

Pytorch logo.png

 $ conda activate python-3.9.21
 (python-3.9.21)$

This is what we will have in the Slurm submission script. Lets now create the python code that will run a simple machine learning exercise, with PyTorch. We will use the MNIST example from PyTorch, run these shell commands to create the working directory and retreive the files:

 $ mkdir -p ~/localscratch/mnist    # creating a working dir
 $ cd  ~/localscratch/mnist       # changing directory to the working dir
 $ wget https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py

And now we can place the python script in our submission script, place the following in a plain text file called torch.job:

#!/bin/bash
#SBATCH -J  gputest               # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -e %j.err             # Name of std err
#SBATCH --partition=gpu1    # Queue
#SBATCH --nodes=1             # Total number of nodes requested
#SBATCH --gres=gpu:1             # Total number of gpus requested
#SBATCH --cpus-per-task=1     # 
#SBATCH --time=00:03:00        # Run time (hh:mm:ss) - 1.5 hours
  
cd ~/localscratch/mnist 
module load applications/gpu/python/conda-25.1.1-python-3.9.21
conda activate python-3.9.21  
python  main.py

Finally we can submit this script to Slurm, which will run the entire process for in the background.

 $ sbatch torch.job

Watch Demo

Next: Module_system

Up: HPC_Usage