✓ AIP1 Isambard-AI Phase 1 supported
✓ AIP2 Isambard-AI Phase 2 supported
✓ I3 Isambard 3 supported
✗ BC5 BlueCrystal 5 unsupported

Packaging Datasets with SquashFS

Abstract

This tutorial introduces SquashFS as a method for packaging large datasets on Isambard supercomputers. SquashFS is a compressed, read-only filesystem that is particularly well-suited to storing datasets composed of many small files, which can otherwise cause performance problems. By the end of the tutorial, users will know how to create, inspect, and use SquashFS images in batch jobs and containers.

Prerequisites

Users will need to be a member of a project on a BriCS service, and have completed the setup tutorial. Familiarity with the Linux command line, conda and basic Slurm job submission (covered in the Installing and Running Software tutorial) is assumed.

Learning Objectives

By the end of this tutorial, users will have an understanding of how to:

How using SquashFS can make their work run faster
How to create a SquashFS image
How to mount and use a SquashFS image both with and without a container
Why using SquashFS is beneficial for working with datasets containing many files

Table of Contents¶

Table of Contents
Logging in via SSH
Why SquashFS?
Defining a simple example
Setting up the conda environment
Testing on a single file
Creating a SquashFS file on Isambard
Optional: Preparing a SquashFS image locally and upload to Isambard
- macOS
- Windows (WSL)
Mounting SquashFS and Testing on a single SquashFS file
Running on a compute node with and without SquashFS
Mounting SquashFS in containers

Logging in via SSH¶

Use the Login Guide if needed to log in. The general format for logging in is:

ssh [PROJECT].[FACILITY].isambard

Why SquashFS?¶

Parallel filesystems such as those used on BriCS platforms are optimised for large sequential reads and writes. They can perform relatively poorly when working with very large numbers of small files, because each operation generates a separate metadata request to the filesystem servers. This can result in a loss of overall application performance.

This issue is especially relevant to machine learning datasets, which can consist of millions of individual image or text files. Moreover, storage quotas place a limit on the number of files as well as total disk space.

SquashFS addresses these issues by packing an entire directory tree into a single compressed archive file. The dataset is then accessed by mounting this single file, dramatically reducing the metadata load on the filesystem and potentially increasing overall application performance.

SquashFS is also read-only, which suits datasets that are written once and read many times and reduces the risks of accidental editing of the data.

Use the Storage spaces information page to review the storage locations available to you and their intended use before starting the exercises.

Defining a simple example¶

For this tutorial, we will use the Street View House Numbers (SVHN) dataset as a small but illustrative example of a many-file dataset. It consists of around 33,000 .png files each a few kilobytes in size, with the overall dataset using 587 MB.

We will perform a simple resizing action on each file to create a toy workload that tries to rapidly read through the dataset. We will use the Pillow Python module installed via a conda environment, and compare performance with and without using SquashFS.

Setting up the conda environment¶

Follow the early steps of the Python guide to install Miniforge.

Now we will activate conda. When conda is activated, it starts in the base environment by default.

user.project@login44:~> source ~/miniforge3/bin/activate
(base) user.project@login44:~>

Now we will create a new environment for this tutorial and install Pillow:

(base) user.project@login44:~> conda create --name squashfs_tut python=3.14.0
(base) user.project@login44:~> conda activate squashfs_tut
(squashfs_tut) user.project@login44:~> conda install Pillow

These steps will take a few moments while the packages are downloaded and installed.

You can verify the environment is set up correctly by running this command that returns no output if successful:

(squashfs_tut) user.project@login44:~> python -c "import PIL"

Testing on a single file¶

The uncompressed dataset has been downloaded and uncompressed as /projects/brics/public/data/svhn_dataset/train.

Warning

Please do not create your own copies of the dataset.

We can briefly demonstrate performing the resize on a single file interactively in Python. It runs effectively instantly.

(squashfs_tut) user.project@login44:~> python
Python 3.14.0 | packaged by conda-forge | (main, Dec  2 2025, 19:50:15) [GCC 14.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> with Image.open("/projects/brics/public/svhn_dataset/train/20011.png") as img:
...     print(img.size)
...     resized = img.resize((800, 600))
...     print(resized.size)
...
(74, 32)
(800, 600)
>>> exit()
(squashfs_tut) user.project@login44:~>

Creating a SquashFS file on Isambard¶

Tip

If you are uploading a dataset to Isambard for the first time you can create the SquashFS image on your local machine before transferring it. See Preparing a SquashFS image locally and upload to Isambard to skip ahead.

Now we will create a SquashFS image from these data. There are many different options available and the example provided here has been chosen to be simple and minimise CPU overhead due to decompression. We will create it on the same type of storage as the original data to allow a fair comparison.

There are two key commands when creating and using SquashFS images:

mksquashfs.static to create the image
squashfuse to mount the image in your user space so it can be read like a normal directory

For large datasets, the one-off creation of the image can be resource intensive, so we will use a compute node.

Note

For Isambard 3, users should swap --gpus=1 for --nodes=1 due to the differing hardware on each platform.

user.project@login40:~> srun --nodes=1 --gpus=1 --time=00:15:00 --pty /bin/bash --login
srun: job 1566942 queued and waiting for resources
srun: job 1566942 has been allocated resources

user.project@login40:~> SOURCE_DIR=/projects/brics/public/svhn_dataset/train
user.project@login40:~> SQUASHFS=$PROJECTDIR/svhn_tutorial.squashfs
user.project@login40:~> mksquashfs.static $SOURCE_DIR $SQUASHFS -no-compression -no-xattrs -processors 72

To exit the interactive session on the compute node, use Ctrl+D or type logout once. Repeating will log you out of the login node.

Optional: Preparing a SquashFS image locally and upload to Isambard¶

If you are uploading a dataset for the first time you can create the SquashFS image on your local machine before transferring it. This avoids writing individual files to the parallel filesystem at all and is the preferred approach for new data.

If you'd like to try that with this tutorial download the dataset linked to above on to your computer and then extract it. You can then create the SquashFS image on your computer using these instructions and uploading the correctly named image to Isambard.

macOS¶

Install squashfs-tools using either Homebrew or MacPorts:

HomebrewMacPorts

brew install squashfs

sudo port install squashfs-tools

Then create the image:

mksquashfs /path/to/your/dataset mydata.squashfs -no-compression -no-xattrs

Windows (WSL)¶

The recommended approach on Windows is to use the Windows Subsystem for Linux (WSL). With an Ubuntu WSL environment, install squashfs-tools:

sudo apt install squashfs-tools

Your Windows drives are accessible under /mnt/ in WSL, so a folder at C:\Users\username\dataset becomes /mnt/c/Users/username/dataset:

mksquashfs /mnt/c/Users/you/dataset mydata.squashfs -no-compression -no-xattrs

See the file transfer guide for instructions on how to transfer the resulting image to Isambard.

Mounting SquashFS and Testing on a single SquashFS file¶

Now we can adapt the above example to use this image. There are 3 simple changes:

Mounting the SquashFS image
Changing the path used in the Python
Unmounting the SquashFS image at the end

(squashfs_tut) user.project@login44:~> mkdir -p $SCRATCH/squashfs_tut
(squashfs_tut) user.project@login44:~> squashfuse $PROJECTDIR/svhn_tutorial.squashfs $SCRATCH/squashfs_tut

(squashfs_tut) user.project@login44:~> python
Python 3.14.0 | packaged by conda-forge | (main, Dec  2 2025, 19:50:15) [GCC 14.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from PIL import Image
>>> with Image.open(f"{os.environ['SCRATCH']}/squashfs_tut/20011.png") as img:
...     print(img.size)
...     resized = img.resize((800, 600))
...     print(resized.size)
...
(74, 32)
(800, 600)
>>> exit()
(squashfs_tut) user.project@login44:~> fusermount -u $SCRATCH/squashfs_tut

Running on a compute node with and without SquashFS¶

Now we will run the resize on a large number of files on a compute node using both the original dataset and the SquashFS image we have created.

We will adapt the interactive python example to create a script that loops through the entire directory using an environment variable to easily swap between the two locations. Create resize_files.py in a convenient location:

import os
import pathlib
from PIL import Image

# Get the path to the .png files from an environment variable
pngdir = pathlib.Path(os.environ.get("PNG_FILE_DIR"))

# Loop through every .png file in the directory
for file in pngdir.glob("*.png"):
    with Image.open(file) as img:
        resized = img.resize((800, 600))

Now we can start an interactive session on a compute node and activate the conda environment:

user.project@login44:~> srun --nodes=1 --gpus=1 --time=00:15:00 --pty /bin/bash --login

user.project@login44:~> source ~/miniforge3/bin/activate
(base) user.project@nid010581:~> conda activate squashfs_tut

We run this script on the normal files using perf stat to provide a basic measure of the runtime.

(squashfs_tut) user.project@nid010581:~> export PNG_FILE_DIR=/projects/brics/public/svhn_dataset/train
(squashfs_tut) user.project@nid010581:~> perf stat python resize_files.py

 Performance counter stats for 'python resize_files.py':

[snip]

     143.193323967 seconds time elapsed

      88.432107000 seconds user
       6.722603000 seconds sys

Then when mounting and using the SquashFS image:

(squashfs_tut) user.project@nid010581:~> mkdir -p $SCRATCH/squashfs_tut
(squashfs_tut) user.project@nid010581:~> squashfuse $PROJECTDIR/svhn_tutorial.squashfs $SCRATCH/squashfs_tut
(squashfs_tut) user.project@nid010581:~> export PNG_FILE_DIR=$SCRATCH/squashfs_tut
(squashfs_tut) user.project@nid010581:~> perf stat python resize_files.py

 Performance counter stats for 'python resize_files.py':

[snip]

      91.470816951 seconds time elapsed

      87.091545000 seconds user
       2.109857000 seconds sys

The job completes significantly faster when using SquashFS.

To exit the interactive session on the compute node, use Ctrl+D or type logout once. Repeating will log you out of the login node.

Mounting SquashFS in containers¶

It is also possible to mount SquashFS images inside a container just like a normal directory. To demonstrate this we use the lolcow container, which is covered in more detail in the Installing and Running Software tutorial.

Tip

Keeping your datasets separate from your containers is good practice.

If you don't already have a copy create a fresh directory, and then pull and build the container:

user.project@login41:~> mkdir sif-images
user.project@login41:~> cd sif-images/
user.project@login41:~/sif-images> ls
user.project@login41:~/sif-images> apptainer build lolcow.sif docker://sylabsio/lolcow
INFO:    Starting build...
INFO:    Fetching OCI image...
25.9MiB / 25.9MiB [===============================================================================================================] 100 % 16.5 MiB/s 0s
43.2MiB / 43.2MiB [===============================================================================================================] 100 % 16.5 MiB/s 0s
INFO:    Extracting OCI image...
INFO:    Inserting Apptainer configuration...
INFO:    Creating SIF file...
[============================================================================================================================================] 100 % 0s
INFO:    Build complete: lolcow.sif
user.project@login40:~/sif-images> ls
lolcow.sif

Now we can start an interactive session and use --bind flag to directly mount the image into /tmp. We can demonstrate this has worked by printing the number of files in the directory by piping the outout of the directory listing to word count. As mentioned at the start of the tutorial, there are around 33,000 files:

user.project@login40:~/sif-images> apptainer shell --bind $PROJECTDIR/svhn_tutorial.squashfs:/tmp:image-src=/ lolcow.sif
Apptainer> cowsay $(ls /tmp/ | wc -w)
 _______
< 33404 >
 -------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
Apptainer>

For further information on using containers, please see the containers guide.