Difference between revisions of "Intermediate Usage: PyTorch and Tensorflow"

From KENET Training
Jump to: navigation, search
Line 12: Line 12:
 
   mpi/openmpi-x86_64
 
   mpi/openmpi-x86_64
 
   ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------
 
   ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------
   applications/gpu/gromacs/2024.4        applications/gpu/python/conda-25.1.1-python-3.9.21 (D)
+
   applications/gpu/gromacs/2024.4        applications/eng/gpu/python/conda-26.1.0-python-3.14
 
   applications/gpu/python/base-3.9.21    applications/gpu/qespresso/7.3.1
 
   applications/gpu/python/base-3.9.21    applications/gpu/qespresso/7.3.1
 
   ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core -------------------------
 
   ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core -------------------------
Line 22: Line 22:
 
This conda  module that appear in the prior list has both TensorFlow and PyTorch installed:
 
This conda  module that appear in the prior list has both TensorFlow and PyTorch installed:
 
<code bash>
 
<code bash>
   applications/gpu/python/conda-25.1.1-python-3.9.21
+
   applications/eng/gpu/python/conda-26.1.0-python-3.14
 
</code>
 
</code>
  
Line 28: Line 28:
 
We can Load the module using this Slurm command:  
 
We can Load the module using this Slurm command:  
 
<code bash>
 
<code bash>
   module load applications/gpu/python/conda-25.1.1-python-3.9.21
+
   module load applications/eng/gpu/python/conda-26.1.0-python-3.14
 
</code>
 
</code>
  
Line 36: Line 36:
 
<code bash>
 
<code bash>
 
$ conda env list
 
$ conda env list
  base                  /opt/ohpc/pub/conda/instdir
+
base                     /scratch/lustre/apps/eng/gpu/miniconda3
  python-3.9.21         /opt/ohpc/pub/conda/instdir/envs/python-3.9.21
+
octave                   /scratch/lustre/apps/eng/gpu/miniconda3/envs/octave
 +
python-3.12              /scratch/lustre/apps/eng/gpu/miniconda3/envs/python-3.12
 +
python-3.14         /scratch/lustre/apps/eng/gpu/miniconda3/envs/python-3.14
 +
qgis                    /scratch/lustre/apps/eng/gpu/miniconda3/envs/qgis
 
</code>
 
</code>
we can safely ignore the base environment, and make use of the *python-3.9.21* conda environment, this has the two
+
we can safely ignore the base environment, and make use of the *python-3.14* conda environment, this has the two
 
machine learning frameworks, Tensorflow and PyTorch.
 
machine learning frameworks, Tensorflow and PyTorch.
  
 
[[File:Pytorch_logo.png|250px]]
 
[[File:Pytorch_logo.png|250px]]
 
<code bash>
 
<code bash>
   $ conda activate python-3.9.21
+
   $ conda activate python-3.14
 
   (python-3.9.21)$
 
   (python-3.9.21)$
 
</code>
 
</code>
Line 68: Line 71:
 
    
 
    
 
  cd ~/localscratch/mnist  
 
  cd ~/localscratch/mnist  
  module load applications/gpu/python/conda-25.1.1-python-3.9.21
+
  module load applications/eng/gpu/python/conda-26.1.0-python-3.14
  conda activate python-3.9.21 
+
  conda activate python-3.14
 
  python  main.py
 
  python  main.py
 
</code>
 
</code>

Revision as of 18:33, 20 May 2026

Modules For Machine Learning

The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be different from a jupyter notebook interface, since everything has to be run in the background. As a user, you will place all your training/inference/testing/IO code in a python script, which then will be added as a command in the shell script section of the slurm job submission file.

Listing available modules

To view all module available, we can use the Slurm command:

 $ module av
  ----------------------------------------------------------------- /usr/share/modulefiles -------------------------------
  mpi/openmpi-x86_64
  ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------
  applications/gpu/gromacs/2024.4        applications/eng/gpu/python/conda-26.1.0-python-3.14
  applications/gpu/python/base-3.9.21    applications/gpu/qespresso/7.3.1
  ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core -------------------------
  lmod    settar

Conda logo.svg.png

Modules with Tensorflow and PyTorch

This conda module that appear in the prior list has both TensorFlow and PyTorch installed:

 applications/eng/gpu/python/conda-26.1.0-python-3.14

Loading The Python (Conda) Module

We can Load the module using this Slurm command:

 module load applications/eng/gpu/python/conda-26.1.0-python-3.14

Listing Conda Environments

The loaded module gives us access to a custom conda module, and we can now list the conda environments available

$ conda env list base /scratch/lustre/apps/eng/gpu/miniconda3 octave /scratch/lustre/apps/eng/gpu/miniconda3/envs/octave python-3.12 /scratch/lustre/apps/eng/gpu/miniconda3/envs/python-3.12 python-3.14 * /scratch/lustre/apps/eng/gpu/miniconda3/envs/python-3.14 qgis /scratch/lustre/apps/eng/gpu/miniconda3/envs/qgis we can safely ignore the base environment, and make use of the *python-3.14* conda environment, this has the two machine learning frameworks, Tensorflow and PyTorch.

Pytorch logo.png

 $ conda activate python-3.14
 (python-3.9.21)$

This is what we will have in the Slurm submission script. Lets now create the python code that will run a simple machine learning exercise, with PyTorch. We will use the MNIST example from PyTorch, run these shell commands to create the working directory and retreive the files:

 $ mkdir -p ~/localscratch/mnist    # creating a working dir
 $ cd  ~/localscratch/mnist       # changing directory to the working dir
 $ wget https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py

And now we can place the python script in our submission script, place the following in a plain text file called torch.job:

#!/bin/bash
#SBATCH -J  gputest               # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -e %j.err             # Name of std err
#SBATCH --partition=gpu1    # Queue
#SBATCH --nodes=1             # Total number of nodes requested
#SBATCH --gres=gpu:1             # Total number of gpus requested
#SBATCH --cpus-per-task=1     # 
#SBATCH --time=00:03:00        # Run time (hh:mm:ss) - 1.5 hours
  
cd ~/localscratch/mnist 
module load applications/eng/gpu/python/conda-26.1.0-python-3.14
conda activate python-3.14
python  main.py

Finally we can submit this script to Slurm, which will run the entire process for in the background.

 $ sbatch torch.job

Watch Demo

Caveat: Downloading Data Ahead of Time

Compute nodes will typically be sealed off from the internet, and as such, it is important to have all data aready on disk before a batch job submission as such, we can now refactor the mixed data download and training in https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py as shown in this repo: https://github.com/Materials-Modelling-Group/training-examples/tree/main/mnist


the first can be run from the login node, the latter can be run from the batch script.

Next: Module_system

Up: HPC_Usage