Skip to content
Merged
3 changes: 2 additions & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,11 @@
- [Challenges](./chapter2/challenges.md)

- [M3](./chapter3/chapter3.md)
- [Getting Started](./chapter3/start.md)
- [Logging In](./chapter3/login.md)
- [Linux Commands](./chapter3/linux-cmds.md)
- [Compiling](./chapter3/compiling.md)
- [M3's Shared Filesystem](./chapter3/shared-fs.md)
- [Software and Tooling](./chapter3/software-tooling.md)
- [Bash Scripts](./chapter3/bash.md)
- [Job batching & SLURM](./chapter3/slurm.md)
- [Challenges](./chapter3/challenges.md)
Expand Down
41 changes: 41 additions & 0 deletions src/chapter3/bash.md
Original file line number Diff line number Diff line change
@@ -1 +1,42 @@
# Bash Scripts

Bash is both a command line interface and a scripting language. Linux commands are generally using Bash. Bash scripts are a series of commands that are executed in order. Bash scripts are useful for automating tasks that you do often, or for running a series of commands that you don't want to type out every time. In our case, Bash scripts are used for running jobs on M3.

In terms of use, Bash can encapsulate any command you would normally run in the terminal into a script that can be easily reused. For example you could have a script that automatically navigates to a directory and activates a virtual environment, or a script that compiles and runs a C program.

The basic syntax of a bash file is as follows:

```bash
#!/bin/bash

# This is a comment

echo "Hello World"
```

We can save this file as `hello.sh` and run it using the following command: `source hello.sh`. This will print `Hello World` to the terminal.

Let's walk through the file. The first line is `#!/bin/bash`. This is called a shebang, and it tells the system that this file is a bash script. The second line is a comment, and is ignored by the system. The third line is the actual command that we want to run. In this case, we are using the `echo` command to print `Hello World` to the terminal.

Bash can do a lot more, including basic arithmetic, if statements, loops, and functions, however these are not really necesary for what we are doing. If you want to learn more about bash, you can find a good tutorial [here](https://linuxconfig.org/bash-scripting-tutorial).

For our use, the main things we need to be able to do are to run executables and files. We do this the exact same way as if manually running them in the terminal. For example, if we want to run a python script, we can do the following:

```bash
#!/bin/bash

# This will run hello.py using the python3 executable
python3 hello.py
```

If we want to compile and then run a C program, we can do the following:

```bash
#!/bin/bash

# This will compile hello.c and then run it
gcc hello.c -o hello
./hello
```

Using bash scripts not only saves a lot of time and effort, but it also makes it easier to run jobs on M3 using SLURM. We will go over how to do this in the next section.
14 changes: 14 additions & 0 deletions src/chapter3/challenges.md
Original file line number Diff line number Diff line change
@@ -1 +1,15 @@
# Challenges

## Challenge 1

Something simple to start off. Create a bash script called `hello.sh` that prints "Hello World" to the screen. Submit this job to the queue using `sbatch`. Check the status of the job using `squeue`. Once the job has finished, check the output using `cat`. You can find the output file in the directory you submitted the job from.
Comment thread
jasparm marked this conversation as resolved.

## Challenge 2

Something a bit more involved. Clone your [challenges repository](https://github.com/MonashDeepNeuron/HPC-Training-Challenges.git) into your personal folder in the scratch directory. Then, in this same directory, create a submission script that will install python 3.10 using miniconda, create a virtual environment, install the necessary dependencies, and clone and run the `alexnet_stl10.py` script in the M3 section. Remember, don't directly load python using module, follow the instructions in the [software tooling](./software-tooling.md#python) chapter.
Once completed, commit and push your changes as well as the output.

## Challenge 3

A continuation of challenge 2. Edit your submission script so that you get a gpu node, and run the script using the gpu.
Commit and push your changes as well as the output.
6 changes: 6 additions & 0 deletions src/chapter3/chapter3.md
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# M3

[M3](https://docs.massive.org.au/M3/index.html) is part of [MASSIVE](https://https://www.massive.org.au/), which is a High Performance Computing facility for Australian scientists and researchers. Monash University is a partner of MASSIVE, and provides as majority of the funding for it. M3 is made up of multiple different types of servers, with a total of 5673 cores, 63.2TB of RAM, 5.6PB of storage, and 1.7 million CUDA cores.

M3 utilises the [Slurm](https://slurm.schedmd.com/) workload manager, which is a job scheduler that allows users to submit jobs to the cluster. We will learn a bit more about this later on.

This book will take you through the basics of connecting to M3, submitting jobs, transferring data to and from the system and some other things. If you want to learn more about M3, you can read the [M3 documentation](https://docs.massive.org.au/M3/index.html). This will give you a more in-depth look at the system, and how to use it.
1 change: 0 additions & 1 deletion src/chapter3/compiling.md

This file was deleted.

Binary file added src/chapter3/imgs/aaf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/gurobi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/gurobi2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/hpcid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/join_project.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/putty_key_not_cached.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/chapter3/imgs/putty_start.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
46 changes: 46 additions & 0 deletions src/chapter3/linux-cmds.md
Original file line number Diff line number Diff line change
@@ -1 +1,47 @@
# Linux Commands

Even if you are already familiar with linux, please read through all of these commands, as some are specific to M3.

## Basic Linux Commands

| Command | Function |
| --- | --- |
| `pwd` | prints current directory |
| `ls` | prints list of files / directories in current directory (add a `-a` to list everything, including hidden files/directories |
| `mkdir` | makes a directory |
| `rm <filename>` | deletes *filename*. add `-r` to delete directory. add `-f` to force deletion (be really careful with that one) |
| `cd <directory>` | move directory. |
| `vim` or `nano` | bring up a text editor |
| `cat <filename>` | prints contents of file to terminal |
| `echo` | prints whatever you put after it |
| `chmod <filename>` | changes permissions of file |
| `cp` | copy a file or directory|
| `mv <filename>` | move or rename file or directory |

> Note: `.` and `..` are special directories. `.` is the current directory, and `..` is the parent directory. These can be used when using any command that takes a directory as an argument. Similar to these, `~` is the home directory, and `/` is the root directory. For example, if you wanted to copy something from the parent directory to the home directory, you could do `cp ../<filename> ~/`, without having to navigate anywhere.

## Cluster Specific Commands

| Command | Function | Flags
| --- | --- | --- |
| `show_job` | prints information about your jobs |
| `show_cluster` | prints information about the cluster |
| `user_info` | prints information about your account |
| `squeue` | prints information about your jobs | `-u <username>` to print information about a specific user |
| `sbatch <slurm_script_file>` | submit a job to the cluster |
| `scontrol show job <JOBID>` | prints information about a specific job |
| `scancel <JOBID>` | cancel a job |

## M3 Specific Commands

| Command | Function |
| --- | --- |
| `module load <module_name>` | load a module |
| `module unload <module_name>` | unload a module |
| `module avail` | list available modules |
| `module list` | list loaded modules |
| `module spider <module_name>` | search for a module |
| `module help <module_name>` | get help for a module |
| `module show <module_name>` | show details about a module |
| `module purge` | unload all modules |
| `module swap <module_name> <module_name>` | swap two modules |
76 changes: 76 additions & 0 deletions src/chapter3/login.md
Original file line number Diff line number Diff line change
@@ -1 +1,77 @@
# Logging In

First you will need to ssh into a login node in the cluster. You do this by doing the following:

## Windows

If you are using windows, the best way to ssh into m3 is by using [puTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).

Once installed and opened, you will see a page like this:

![puTTY config page](./imgs/putty_start.png)

Type in your m3 username followed by `@m3.massive.org.au` and press enter or the Open button.

If it the first time accessing M3 from this client then you may see something like this:

![puTTY auth page](./imgs/putty_key_not_cached.png)

Just click Accept, and puTTY will add the cluster's ssh fingerprint to cache.

## Mac / Linux

On macOS or linux, ssh is built into the terminal, so just copy the following into your shell, substituting username for your username.

```bash
ssh username@m3.massive.org.au
```

You may get a similar warning to the above image about the server identity, just type `yes` or `Y` to accept it and add the ssh key to cache.

Everything from now on will be the same across whatever computer you are using to access the cluster.

The first thing to pop up will be a request for a password. Don't worry when you don't see your cursor moving when typing, this is just for security. Your password is still being recorded.

Once you have logged in, you will come to a page that looks like this:

```txt
+----------------------------------------------------------------------------+
| Welcome to MASSIVE M3 |
| |
| For assistance please contact help@massive.org.au or (03) 9902 4845 |
| The MASSIVE User Guide https://docs.massive.org.au |
+----------------------------------------------------------------------------+

- Useful Slurm Commands:
squeue
sbatch <slurm_script_file>
scontrol show job <JOBID>
scancel <JOBID>

- Massive User Scripts:
show_job
show_job <JOBID>
show_cluster
user_info

- Slurm Sample Scripts are Here:
/usr/local/hpcusr/latest/training/samples/slurm/

- We recommend using smux to compile and test code on compute nodes.
- How to use smux: https://docs.massive.org.au/M3/slurm/interactive-jobs.html

For more details, please see:
https://docs.massive.org.au/M3/slurm/slurm-overview.html
------------------------------------------------------------------------------

Use MASSIVE Helpdesk to request assistance with MASSIVE related computing
questions and problems. Email to help@massive.org.au and this will generate
a ticket for your issue.

------------------------------------------------------------------------------


[jasparm@m3-login2 ~]$
```

Once you are done and want to logout, just type `exit`. This will close the connection.
57 changes: 57 additions & 0 deletions src/chapter3/shared-fs.md
Original file line number Diff line number Diff line change
@@ -1 +1,58 @@
# M3's Shared Filesystem

When we talk about a shared filesystem, what we mean is that the filesystem that M3 uses allows multiple users or systems to access, manage, and share files and directories over a network, concurrently. It enables users to collaborate on projects, share resources, and maintain a unified file structure across different machines and platforms. In addition to this, it enables the many different compute nodes in M3 to access data from a single source which users also have access to, simplifying the process of running jobs on M3.

Very simply, the way it works is that the home, project and scratch directories are mounted on every node in the cluster, so they are accessible from any node.

M3 has a unique filesystem consisting of three main important parts (for you).
Comment thread
jasparm marked this conversation as resolved.

## Home Directory

There is each user's personal directory, which only they have access to. This has a ~10GB allocation, and should store any hidden files, configuration files, or other files that you don't want to share with others. This is backed up nightly.

## Project Directory

This is the shared project directory, for all users in MDN to use. This has a ~1TB allocation, and should be used only for project specific files, scripts, and data. This is also backed up nightly, so in the case that you accidentally delete something important, it can be recovered.

## Scratch Directory

This is also shared with all users in MDN, and has more allocation (~3TB). You may use this for personal projects, but keep your usage low. In general it is used for temporary files, larger datasets, and should be used for any files that you don't need to keep for a long time. This is not backed up, so if you delete something, it's gone forever.

## General Rules

- Keep data usage to a minimum. If you have a large amount of data, consider moving it to the scratch directory. If it is not necessary to keep it, consider deleting it.
- Keep your home directory clean.
- In general, it is good practice to make a directory in the shared directory for yourself. Name this your username or name, to make it easily identifiable. This is where you should store your files for small projects or personal use.
- The project directory is not for personal use. Do not store files in the project directory that are not related to MDN. Use the scratch directory instead.

## Copying files to and from M3

### Using scp

You can copy files to M3 using the `scp` command. This is a command line tool that is built into most linux distributions, and is available on Windows through [PuTTY](https://www.putty.org/).

#### Linux / Mac

To copy a file to M3, use the following command:

```bash
scp <file> <username>@m3.massive.org.au:<destination>
```

For example, if I wanted to copy a file called `test.txt` to my home directory on M3, I would use the following command:

```bash
scp test.txt jasparm@m3.massive.org.au:~
```

To copy a file from M3 to your local machine, use the following command:

```bash
scp <username>@m3.massive.org.au:<file> <destination>
```

So, to bring that same file back to my local machine, I would use the following command:

```bash
scp jasparm@m3.massive.org.au:~/test.txt .
```
69 changes: 69 additions & 0 deletions src/chapter3/slurm.md
Original file line number Diff line number Diff line change
@@ -1 +1,70 @@
# Job batching & SLURM

Launching and running jobs on M3 is controlled by [SLURM](https://slurm.schedmd.com/). You don't really need to know a lot about it in order to use it, so this section will take you through the basics of what you will need for what we are doing.

If you want a complete guide on SLURM in M3, you can find it [here](https://docs.massive.org.au/M3/slurm/slurm-overview.html).

## Submitting simple jobs

As we discussed in the previous section we use bash scripts to run jobs on M3. We can submit these jobs using the `sbatch` command. For example, if we have a bash script called `hello.sh` that contains the following:

```bash
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=1MB
#SBATCH --time=0-00:01:00
#SBATCH --job-name=hello
#SBATCH --partition=m3i
#SBATCH --mail-user=jmar0066@student.monash.edu
#SBATCH --mail-type=BEGIN,END,FAIL

echo "Hello World"
```

We can submit this job using the following command:

`sbatch hello.sh`

This will submit the job to the queue, and you will get an email when the job starts, finishes, or fails. You can also check the status of your job using the `squeue` command.

## Options

You might have noticed the `#SBATCH` lines in the bash script. These are called options, and they tell SLURM how to run the job. The options we used in the example above are:

- `ntasks`: The number of tasks or processes to run.
Comment thread
jasparm marked this conversation as resolved.
- `mem`: The amount of memory to allocate to the job.
- `time`: The maximum amount of time the job can run for.
- `job-name`: The name of the job. Up to 15 characters.
- `partition`: The partition to run the job on.
- `mail-user`: The email address to send job status emails to.
- `mail-type`: The types of emails to send.

> Note: In the case of M3, a task is essentially the same as a process. This is **not** the same as a cpu core. You can have a task that uses one or multiple cores. You can also have multiple tasks comprising the same job, each with one or multiple cores being utilised. It can get quite confusing, so if you are unsure about what you need, just ask. There is also more information in the M3 docs.

There are a lot more options that you can use, and you can find a more complete list [here](https://docs.massive.org.au/M3/slurm/simple-batch-jobs.html).

In particular, if you want to run multithreading or multiprocessing jobs, or you need a gpu, there are more options you need to configure.

## Interactive jobs
Comment thread
jasparm marked this conversation as resolved.

Sometimes you might want to actually connect to the node that you are running your job on, in order to see what is happening or to set it up before running the job. You can do this using the `smux` command. Similar to regular batch jobs, you can set options when you start the interactive session. An example of this is:

`smux new-session --ntasks=1 --time=0-00:01:00 --partition=m3i --mem=4GB`

This will start an interactive session on a node with 1 cpu, 1 minute of time, and 4GB of memory. There are again other options available, and you can find a more complete explanation [here](https://docs.massive.org.au/M3/slurm/interactive-jobs.html).

### Connecting to interactive jobs

Typically when you start an interactive job it will not start immediately. Instead, it will be queued up once it has started you will need to connect to it. You can do this by running `smux a`, which will reconnect you to the session. If you want to disconnect from the session but leave it running, you can press `Ctrl + b` followed by `d`. This will disconnect you from the session, but leave it running. You can reconnect to it later using `smux a`. If you want to kill the session, if you are connected just run `exit`, otherwise if you are in a login node run `scancel <jobid>`. You can find the job id using `show_job`.

## Checking the status of jobs, finding out job IDs, and killing jobs

A couple of useful commands for general housekeeping are:

- `squeue`: This will show you the status of all jobs currently running on M3.
- `show_job`: This will show you the status of all jobs you have submitted.
- `squeue -u <username>`: This will show you the status of all jobs submitted by a particular user currently running.
- `scancel <jobid>`: This will kill a job with a particular job id.
- `scancel -u <username>`: This will kill all jobs submitted by a particular user.
- `show_cluster`: This will show you the status of the cluster, including any nodes that are offline or in maintenance.
Loading