Difference between revisions of "Intermediate Usage: PyTorch and Tensorflow"
(Created page with "=== Modules For Machine Learning === The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be dif...") |
|||
Line 9: | Line 9: | ||
<code bash> | <code bash> | ||
$ module av | $ module av | ||
− | + | ----------------------------------------------------------------- /usr/share/modulefiles ------------------------------- | |
− | |||
mpi/openmpi-x86_64 | mpi/openmpi-x86_64 | ||
− | + | ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------ | |
− | |||
applications/gpu/gromacs/2024.4 applications/gpu/python/conda-25.1.1-python-3.9.21 (D) | applications/gpu/gromacs/2024.4 applications/gpu/python/conda-25.1.1-python-3.9.21 (D) | ||
applications/gpu/python/base-3.9.21 applications/gpu/qespresso/7.3.1 | applications/gpu/python/base-3.9.21 applications/gpu/qespresso/7.3.1 | ||
− | + | ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core ------------------------- | |
− | |||
lmod settar | lmod settar | ||
</code> | </code> | ||
Line 37: | Line 34: | ||
<code bash> | <code bash> | ||
$ conda env list | $ conda env list | ||
− | |||
# conda environments: | # conda environments: | ||
# | # | ||
Line 61: | Line 57: | ||
<code bash> | <code bash> | ||
#!/bin/bash | #!/bin/bash | ||
− | |||
#SBATCH -J gputest # Job name | #SBATCH -J gputest # Job name | ||
#SBATCH -o job.%j.out # Name of stdout output file (%j expands to jobId) | #SBATCH -o job.%j.out # Name of stdout output file (%j expands to jobId) | ||
Line 70: | Line 65: | ||
#SBATCH --cpus-per-task=1 # | #SBATCH --cpus-per-task=1 # | ||
#SBATCH --time=00:03:00 # Run time (hh:mm:ss) - 1.5 hours | #SBATCH --time=00:03:00 # Run time (hh:mm:ss) - 1.5 hours | ||
− | + | ||
cd ~/mnist | cd ~/mnist | ||
module load applications/gpu/python/conda-25.1.1-python-3.9.21 | module load applications/gpu/python/conda-25.1.1-python-3.9.21 |
Revision as of 18:55, 12 April 2025
Contents
Modules For Machine Learning
The cluster has ready made python environments with conda, Tensorflow as well as PyTorch for machine learning users. The usage will be different from a jupyter notebook interface, since everything has to be run in the background. As a user, you will place all your training/inference/testing/IO code in a python script, which then will be added as a command in the shell script section of the slurm job submission file.
Listing available modules
To view all module available, we can use the Slurm command:
$ module av ----------------------------------------------------------------- /usr/share/modulefiles ------------------------------- mpi/openmpi-x86_64 ----------------------------------------------------------------- /opt/ohpc/pub/modulefiles ------------------------------ applications/gpu/gromacs/2024.4 applications/gpu/python/conda-25.1.1-python-3.9.21 (D) applications/gpu/python/base-3.9.21 applications/gpu/qespresso/7.3.1 ---------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core ------------------------- lmod settar
Modules with Tensorflow and PyTorch
This module that appear in the prior list has both TensorFlow and PyTorch installed:
applications/gpu/python/conda-25.1.1-python-3.9.21
Loading The Python Module
We can Load the module using this Slurm command:
module load applications/gpu/python/conda-25.1.1-python-3.9.21
Listing Conda Environments
The loaded module gives us access to a custom conda module, and we can now list the conda environments available
$ conda env list
# conda environments: # base /opt/ohpc/pub/conda/instdir python-3.9.21 /opt/ohpc/pub/conda/instdir/envs/python-3.9.21
we can safely ignore the base environment, and make use of the *python-3.9.21* conda environment, this has the two machine learning frameworks, Tensorflow and PyTorch.
$ conda activate python-3.9.21 (python-3.9.21)$
This is what we will have in the Slurm submission script.
Lets now create the python code that will run a simple machine learning exercise, with PyTorch. We will use the
MNIST example from PyTorch, run these shell commands to create the working directory and retreive the files:
$ mkdir ~/mnist # creating a working dir $ cd ~/mnist # changing directory to the working dir $ wget https://raw.githubusercontent.com/pytorch/examples/refs/heads/main/mnist/main.py
And now we can place it in our submission script:
#!/bin/bash #SBATCH -J gputest # Job name #SBATCH -o job.%j.out # Name of stdout output file (%j expands to jobId) #SBATCH -e %j.err # Name of std err #SBATCH --partition=gpu1 # Queue #SBATCH --nodes=1 # Total number of nodes requested #SBATCH --gres=gpu:1 # Total number of gpus requested #SBATCH --cpus-per-task=1 # #SBATCH --time=00:03:00 # Run time (hh:mm:ss) - 1.5 hours cd ~/mnist module load applications/gpu/python/conda-25.1.1-python-3.9.21 conda activate python-3.9.21 python main.py
Finally we can submit this script to Slurm, which will run the entire process for in the background.