Introducing Isambard-AI: Interactive Chatbot tutorial¶

Abstract

The workshop aims to present Isambard-AI. This is a new leadership-class supercomputer funded by the Department of Science Innovation and Technology (DSIT) and UKRI to quickly address the lack of national AI-capable supercomputing facilities for open research.

Isambard-AI is one of the new AI Research Resource (AIRR) sites. The full Isambard-AI system will support a significant step-change in AI-enabled research in the UK.

During the workshop the system, the service, best practices to run AI workloads and the mechanism to access resources will be presented.

Prerequisites

We welcome attendees from all domain backgrounds, that have non-zero level of expertise in using AI frameworks (such as PyTorch and TensorFlow) so they will be able to follow the demonstrations and understand how AI workflows can be mapped to this new supercomputer. High Performance Computing knowledge is not required.

Learning Objectives

Attendees of this tutorial will leave with a better understanding of 3 major points:

What is Isambard-AI and how to access it;
How Isambard-AI works and how to run AI workflows; and
A new perspective on what Isambard-AI can do to improve advancements in AI methods and scientific discovery.

Tutorial Contents¶

Introducing Isambard-AI: Interactive Chatbot tutorial
Tutorial Contents
Tutorial
Introduction
Setting Up the Environment (10 min)
Chat to Isambot (5 min)
Measuring Token Output Rate (5 min)
Hardware Monitoring (5 min)
Mapping the model to a device (5 min)
Break the Model (10 min)
Running a Bigger Model (Bonus round)
Closing the notebook
Conclusion and Discussion
Acknowledgements

Tutorial¶

Introduction¶

Welcome to the Isambard-AI Interactive Chatbot Tutorial. This tutorial will guide you through creating and interacting with a language model-powered chatbot using Isambard-AI, one of the UK’s leading AI-focused supercomputing resources.

Setting Up the Environment (10 min)¶

First please go to https://apps.isambard.ac.uk/jupyter to sign in and launch your JupyterLab session. If you're going through this tutorial as part of a live workshop, please pay attention to if you've been asked to set a reservation name.

The notebook for this session has been placed in your home directory and should be selectable in JupyterLab's file browser (displayed on the left side of the browser tab). Open the notebook titled isambot-tutorial.ipynb.

There are two methods to go through this workshop.

Open the notebook on the left-hand-side file browser. Or download and copy using this link and execute the cells one-by-one.
You can start by creating an empty notebook, and copying and pasting the content from this page into cells.

Chat to Isambot (5 min)¶

We will begin this tutorial by chatting to an LLM model named Isambot. This is based Microsoft's Phi3-mini model available through the hugging face (🤗) machine learning framework. First let's define MODEL_ID as a variable to hold the model's name. You can see more information about the model here on the hugging face model card.

Model Size (Mini, 3.8B): A compact yet powerful model with 3.8 billion parameters, balancing computational efficiency with strong performance in complex language tasks.
Context Window (128k): Features a 128,000-token context window, allowing it to process and generate very long text sequences, ideal for tasks requiring extensive context.
Instruct Version: Fine-tuned for following user instructions, making it especially suited for interactive applications like digital assistants and chatbots.

MODEL_ID="microsoft/Phi-3-mini-128k-instruct"
CACHE_DIR = "/projects/public/brics/cache" # Use pre-downloaded models

We then import the required modules and functions, setting a seed for random generation in the torch (PyTorch) backend, and setting the number of threads to 72 since each user has 1 Grace CPU:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
torch.set_num_threads(72)

Next the model and tokenizer are defined:

AutoModelForCausalLM.from_pretrained(MODEL_ID): Loads the pre-trained causal language model.
torch_dtype="auto": Automatically selects the optimal data type for performance.
trust_remote_code=True: Allows execution of custom model code.
tokenizer: The tokenizer is responsible for converting raw text into tokens that the model can process and generating text from model output tokens. The tokenizer is loaded using AutoTokenizer.from_pretrained(MODEL_ID) to ensure it matches the model.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    cache_dir=CACHE_DIR,
    torch_dtype="auto", 
    trust_remote_code=True, 
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

We then create a workflow pipeline pipe. We define the pipeline type as "text-generation" and pass the model and tokenizer as two arguments.

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

IsamBot uses a predefined pipeline to create an interactive chat experience. It starts with a system prompt that sets the assistant's role and behavior. The bot then enters a loop where it continuously accepts user input via the input() function. Each user input is added to the chat history, and the pipeline generates a response based on this history. The response is printed to the screen, and the conversation continues until the user types 'exit'.

The bot uses PIPELINE_KWARGS to control the generation, such as limiting the number of new tokens and ensuring the response is concise.

Let's add the code for isambot:

def isambot(pipe):
    SYSTEM_PROMPT = "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."
    PIPELINE_KWARGS = {
        "max_new_tokens": 500,
        "return_full_text": False,
    }

    print("🎩 IsamBot chat")
    print("Type 'exit' to end the chat.")

    chat = [{"role": "system", "content": SYSTEM_PROMPT}]
    while True:
        # Get interactive user prompt and append to chat history
        user_input = input("?> ")
        if user_input.lower() == "exit":
            print("Exiting IsamBot chat. Goodbye!")
            break

        chat.append({"role": "user", "content": user_input})

        try:
            # Generate response
            response = pipe(chat, **PIPELINE_KWARGS)
            assert len(response) == 1, "Expected a single response item"
        except Exception as e:
            print(f"An error occurred: {e}")
            continue

        # Output response
        print(f"🎩> {response[0]['generated_text']}\n")

        # Append response to chat history
        chat.append({"role": "assistant", "content": response[0]["generated_text"]})

Acceptable Use

Please remember that by using Isambard-AI, you agreed to the Acceptable Use Policy. It’s crucial that you do not alter the SYSTEM_PROMPT or attempt to prompt the model with inappropriate or unsafe queries. The SYSTEM_PROMPT acts as a vital security measure to ensure the model behaves responsibly. Your cooperation helps maintain the integrity and safety of the system for all users.

Now run Isambot and have a chat! Start by entering "hello" into the text box.

isambot(pipe)

Exiting Isambot

To exit Isambot enter "exit" into the chat.

Measuring Token Output Rate (5 min)¶

We can quantify the speed our model is performing at using the following function:

This function, measure_performance, evaluates the speed of the language model by measuring how many tokens it can generate per second. It tokenizes a given prompt, times the model as it generates a specified number of tokens, and calculates the rate of token generation.

import time

def measure_performance(prompt, model, tokenizer, device, max_new_tokens=50):
    # Tokenize input prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Start the timer
    start_time = time.time()

    # Generate output tokens
    outputs = model.generate(inputs["input_ids"], max_length=max_new_tokens)

    # End the timer
    end_time = time.time()

    # Calculate the number of tokens generated
    num_tokens = outputs.shape[-1]

    # Calculate tokens per second
    elapsed_time = end_time - start_time
    tokens_per_second = num_tokens / elapsed_time

    print(f"Generated {num_tokens} tokens in {elapsed_time:.4f} seconds")
    print(f"Performance: {tokens_per_second:.2f} tokens per second")

    return tokens_per_second

Now we add an example prompt and measure the performance.

prompt = "This is a test to measure the model's performance in generating tokens."
tokens_per_second_cpu = measure_performance(prompt, model, tokenizer, device="cpu", max_new_tokens=50)

Hardware Monitoring (5 min)¶

When training and deploying machine learning models, it is important to check that you are getting the most out of your hardware.

CPU Monitoring¶

To begin, we need to monitor our CPU. Press the "+" icon next to your tab and choose "Terminal".

Run top as follows:

$ top -u $(whoami)

GPU Monitoring¶

Next, let's check the GPUs on our machine. There are two methods to do this:

1. nvidia-smi¶

First you can run nvidia-smi in your terminal. This will show you the number of available GPUs, how much power they are consuming, and the memory utilisation on the GPU.

$ nvidia-smi --list-gpus
# Update the command every 1 second
$ watch -n 1 nvidia-smi

2. nvdashboard¶

Finally, we can use a Jupyterlab widget to visually inspect our GPU utilization, memory, and data transfer speeds. The NVDashboard JupyterLab extension is pre-installed in the JupyterLab session. To start it click on the third tab on the left hand side with a GPU symbol:

Jupyter nvdashboard

We recommend you choose "GPU Memory" and "GPU Utilization".

Measure performance¶

Now let's measure the performance again and see how our hardware monitoring tools respond to the model generating a response:

tokens_per_second_cpu = measure_performance(prompt, model, tokenizer, device="cpu", max_new_tokens=50)

Finally, run isambot(pipe) and see how it affects the CPU and GPU.

Use top and nvidia-smi in the terminal, combined with NVDashboard in the JupyterLab web interface to view the effect of prompting the model.

Is the model using the GPU?

It is always good practice to see how much of the GPU's available memory and performance the model is currently using.

Mapping the model to a device (5 min)¶

Redefine the pipeline to run on a GPU using the device_map argument.

DEVICE_MAP = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    cache_dir=CACHE_DIR,
    device_map=DEVICE_MAP, 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

Keep an eye on your GPU!

If you are using nvidia-smi have a look at the Pwr:Usage/Cap and Memory-Usage columns.

Is the GPU using any power when idle?
Is the GPU using memory when it is idle?
How do these change when the model is generating a response?

Break the Model (10 min)¶

Even though the GPUs on Isambard-AI are powerful, they are still limited. Let's try and break the model by feeding lots of small prompts or one large prompt:

isambot(pipe)

Experiments

Type "hello" into the model repeatedly and watch the memory and power usage. Do they change?
Copy some code from above into the chat window.
Go to loremipsum and copy a large example to stress the model and GPU.

Now let's test the model again after device mapping.

prompt = "This is a test to measure the model's performance in generating tokens."
tokens_per_second_cpu = measure_performance(prompt, model, tokenizer, device=DEVICE_MAP, max_new_tokens=50)

As above, use top and nvidia-smi in the terminal, combined with NVDashboard in the JupyterLab web interface to view the effect of prompting the model on CPU, GPU, and memory utilisation.

Running a Bigger Model (Bonus round)¶

Finally, redefine the MODEL_ID to a bigger model. The previous model phi3-mini, had approximately 4B parameters. This Llama3 model has 70B parameters. Find out more about this version of Llama3 on the hugging face model card. Rerun the previous cells to launch IsamBot with the Llama model.

Llama License

By using the Llama model you agree to the Meta Llama 3 license.

MODEL_ID = "nvidia/Llama3-ChatQA-1.5-70B"

Question

Does the bigger model use more memory or power?

Closing the notebook¶

To close the notebook, go to the menu bar and select File > Hub Control Panel, and select "Stop My Server".

Conclusion and Discussion¶

In this tutorial, you’ve learned how to create and interact with a chatbot using the Isambard-AI supercomputer. We began by introducing the model, setting up the environment, and creating a simple pipeline to generate text. By measuring the token output rate, we gained insights into model performance, and hardware monitoring showed how CPU and GPU usage reacts to a chat bot. We explored mapping the model to a GPU to enhance performance, observing significant improvements in tokens per second. Finally, you experimented with larger models, understanding how model size affects computational resources.

Acknowledgements¶

Created: 19/08/2024. Authors: Wahab Kawafi, James Womack, Matt Williams, Karin Sevegnani (Nvidia), Paul Graham (Nvidia).