AllianceCan: usage notes
Build venv via SSH
-
On terminal:
ssh user_name@rorqual.alliancecan.ca -
Copy-paste
cd /home/user_name/links/projects/def-user_name-ab/user_name
module reset
module load StdEnv/2023 python/3.11
python3.11 -m venv py311
source py311/bin/activate
pip install --no-index --upgrade pip
# Clear the pip cache first
pip cache purge
# Install using a custom temp directory
mkdir -p /scratch/$USER/tmp
export TMPDIR=/scratch/$USER/tmp
# packages you'll need
pip install torch
pip install ipython tqdm transformers optuna triton sklearn
pip install --no-cache-dir xgboost
pip install --no-cache-dir scikit-survival
# required to use same kernel in Jupyter Hub
pip install ipykernel
# create a kernel for use in Jupyter Hub
python -m ipykernel install --user --name myenv311 --display-name "py311"
- Optionally:
Login to Jupyter Hub: https://jupyterhub.rorqual.alliancecan.ca/
Sockeye: usage notes for working in the ARC (Advanced Research Computing) environment
Contents
- One-time setup
- Frequent Linux commands
- Modules
- Conda
- Virtual environments via Conda
- Running an interactive job (with GPU)
- Running an offline job in Slurm
One-time setup
- Apply for Sockeye allocation: https://flex.redcap.ubc.ca/surveys/?s=7MKJT898LK
- Setup Multi-factor authentication. This is mandatory step or you will not be able to SSH.
- Install myVPN:
- Set up guide for Mac users
- Window users may need to email and request OneDrive link to download an installer, as Lisa did in April, 2024
Frequent Linux commands
print_members
Output:
####################################
Allocation members for st-username-1
####################################
sli (Sears Li)
moak1 (Maya Oak)
Modules
To see how to load a software via loading of required module(s), one may need to query on what to load.
For instance, to use Git, one would issue:
$module spider git
Which gives an output like this:
For detailed information about a specific "git" package (including how to load the modules) use the module's full name. Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider git/2.41.0
Following the suggested query, one would be advised to load a version of gcc module
Hence, to use Git, one would issue a command like this:
module load gcc/5.5.0 git/2.41.0
Conda environments
New Python environment with ipython
conda create -n "py3.12" python=3.12 ipython
conda activate py3.12
Replicate exact environment described in a environment.yml
conda env create -f environment.yml
Running an interactive job
salloc --time=10:0:0 --mem=3G --nodes=1 --ntasks=2 --account=st-username-1-gpu --gpus=1
Running an offline job in Slurm
- Create job specification file
- Submit on the job-queue
- Wait for job release and job completion, which should give you log file(s) as specified via the
errorandoutputswitches.
Here's an example job specification:
#!/bin/bash
#SBATCH --account=username # Allocation code
#SBATCH --nodes=1 # Number of nodes for each sub-job.
#SBATCH --ntasks-per-node=1 # Number of tasks per node for each sub-job.
#SBATCH --time=X:00:00 # Estimating X hours of runtime, e.g. X=3 (job will not be complete if actual runtime needed exceeds X)
#SBATCH --mem=YG # Estimating Y GB of memory needed, e.g. Y=8 (will not run successfully if actual memory needed exceeds Y)
#SBATCH --output=logs/array_%A_%a.out # [optional] Redirects output to a unique file for each sub-job.
#SBATCH --error=logs/array_%A_%a.err # [optional] Redirects error logs to a unique file for each sub-job.
#SBATCH --mail-user=your_email_addr@ca # [optional] Email address for job notifications
#SBATCH --job-name=nps_job_array # [optional] Job name
#SBATCH --mail-type=ALL # [optional] Email notifications received for ALL job events [other options: E for errors]
# resets language
export LC_ALL=C; unset LANGUAGE
# Load necessary modules
module load gcc/5.5.0
How much time left before job ends
echo "$(squeue -h -j $SLURM_JOBID -o %L)"
For instance, squeue -u $USER shows you the status:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3951233 gpu interact xxx R 9:19:53 1 se353
Then, you could query about end time like this:
echo "$(squeue -h -j 3951233 -o %L)"
Misc.
Graham has several types of GPUs, some of which are available with less wait:
320 p100 2/node, 12GB, original
70 v100 8/node, 16GB, newer, about 50% faster than P100 and with tensor cores
144 t4 4/node, 16GB, newer, about half a V100, for compute & AI except much slower FP64