Difference between revisions of "Intermediate Usage: PyTorch and Tensorflow"

From KENET Training
Jump to: navigation, search
(Created page with "=== Modules For Machine Learning === The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be dif...")
(No difference)

Revision as of 18:54, 12 April 2025

Modules For Machine Learning

The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be different from a jupyter notebook interface, since everything has to be run in the background. As a user, you will place all your training/inference/testing/IO code in a python script, which then will be added as a command in the shell script section of the slurm job submission file.

Listing available modules

To view all module available, we can use the Slurm command:

 $ module av

/usr/share/modulefiles -------------------------------
  mpi/openmpi-x86_64

/opt/ohpc/pub/modulefiles ------------------------------
  applications/gpu/gromacs/2024.4        applications/gpu/python/conda-25.1.1-python-3.9.21 (D)
  applications/gpu/python/base-3.9.21    applications/gpu/qespresso/7.3.1

/usr/share/lmod/lmod/modulefiles/Core -------------------------
  lmod    settar

Modules with Tensorflow and PyTorch

This module that appear in the prior list has both TensorFlow and PyTorch installed:

 applications/gpu/python/conda-25.1.1-python-3.9.21

Loading The Python Module

We can Load the module using this Slurm command:

 module load applications/gpu/python/conda-25.1.1-python-3.9.21

Listing Conda Environments

The loaded module gives us access to a custom conda module, and we can now list the conda environments available $ conda env list

 # conda environments: 
 #
 base                   /opt/ohpc/pub/conda/instdir
 python-3.9.21          /opt/ohpc/pub/conda/instdir/envs/python-3.9.21

we can safely ignore the base environment, and make use of the *python-3.9.21* conda environment, this has the two machine learning frameworks, Tensorflow and PyTorch.

 $ conda activate python-3.9.21
 (python-3.9.21)$

This is what we will have in the Slurm submission script. Lets now create the python code that will run a simple machine learning exercise, with PyTorch. We will use the MNIST example from PyTorch, run these shell commands to create the working directory and retreive the files:

 $ mkdir ~/mnist    # creating a working dir
 $ cd  ~/mnist      # changing directory to the working dir
 $ wget https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py

And now we can place it in our submission script:

 #!/bin/bash
 #SBATCH -J  gputest               # Job name
 #SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
 #SBATCH -e %j.err             # Name of std err
 #SBATCH --partition=gpu1    # Queue
 #SBATCH --nodes=1             # Total number of nodes requested
 #SBATCH --gres=gpu:1             # Total number of gpus requested
 #SBATCH --cpus-per-task=1     # 
 #SBATCH --time=00:03:00        # Run time (hh:mm:ss) - 1.5 hours
 
 cd ~/mnist
 module load applications/gpu/python/conda-25.1.1-python-3.9.21
 conda activate python-3.9.21  
 python  main.py

Finally we can submit this script to Slurm, which will run the entire process for in the background.