加速大型語言模型：運用 ONNX 格式推理和 KV Cache 的使用

大貓咪收錄於 Python

2024-12-01 2024-12-01 約 2045 字預計閱讀 9 分鐘次閱讀 CC BY-NC 4.0

這篇文章將探討如何使用 ONNX 格式來推理大型語言模型 (LLM)，並深入解析在此過程中運用 KV Cache 的方法，最後會實驗測試使用 KV Cache 後速度的變化。

使用 LLM 來 Inference 相信大家都玩過

但有使用過 ONNX 格式來跑看看嗎？

廣告 AD

Introduction

今天這邊要介紹使用 ONNX 模型來做 inference，順便介紹如何使用 KV Cache

由於手邊沒有 Nvidia GPU，所以使用 OpenVINO 來使用 intel 內顯

ONNX Runtime

ONNX 是一個開放式的模型儲存格式，將訓練好的權重儲存

不同的訓練框架可以使用相同的格式儲存，並遷移到其他的框架上使用

裡面定義了許多常見的操作，可以透過這些操作來搭建 AI 模型

ONNX 模型可以透過 Netron 來觀察內部的架構

而 ONNX Runtime 則是高效能的推理引擎，可以將 ONNX 模型部署到生產環境上使用

Packages

首先我們要安裝下列套件，我們會透過 transformers 來讀取我們的 tokenizer

再來透過 oonnxruntime-openvino 和 openvino 來推理我們的模型

由於我們的 tensor 是使用 numpy 格式，所以也要安裝 numpy

transformers
onnxruntime-openvino==1.19.0
openvino==2024.4.0
numpy

Model

這次我選擇的是比較新的 Llama-3.2-1B-Instruct

因為用的是內顯，我就使用 1B 大小的模型

注意這邊要下載的是 ONNX 格式的模型

可以透過 Huggingface 來下載 Huggingface

Inference

Install OpenVINO by pip in Windows

由於我是透過 pip 在 Windows 使用 OpenVINO，需要使用下列 function 來新增 library 的路徑

import platform

if platform.system() == "Windows":
    import onnxruntime.tools.add_openvino_win_libs as utils

    utils.add_openvino_libs_to_path()

接著我們開始讀取 tokenizer 和 model：

tokenizer 的部分使用 transformers 的 AutoTokenizer 來讀取，指定放置 tokenizer 的資料夾即可
model的部份使用 onnxruntime.InferenceSession 來讀取，需要特別指定是哪一個 onnx 檔案

我們特別指定使用 OpenVINO (OpenVINOExecutionProvider)，並指定使用內顯 (GPU)，但也可以使用 CPU (CPUExecutionProvider)

from transformers import AutoTokenizer

MODEL_PATH = "Llama-3.2-1B-Instruct"

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
# model
model = onnxruntime.InferenceSession(
    os.path.join(MODEL_PATH, "onnx/model.onnx"),
    providers=["OpenVINOExecutionProvider"],
    provider_options=[{"device_type": "GPU", "precision": "FP32"}],
)

接著我們要獲取模型的資料來計算 kv cache 的大小，於是透過 AutoConfig 來讀取模型資料，KV Cache 大小跟模型的 attention layer 數量、key value head 的數量和 head 的 dimension 有關，所以我們都要分別計算出來。

from transformers import AutoConfig

config = AutoConfig.from_pretrained(MODEL_PATH)

NUM_ATTN_LAYER = config.num_hidden_layers
NUM_HEAD = config.num_key_value_heads
HEAD_DIM = config.hidden_size // config.num_attention_heads

接著就是決定模型輸入的 prompt，這裡為了簡單，batch size 就設定兩個一樣的 prompt，並設定最大輸出的 token 數量為 100，這裡可以自由設定。

PROMPT = "Today is a good day"
BATCH = [PROMPT, PROMPT]
MAX_OUTPUT_LEN = 100

決定好輸出的參數，我們可以來準備要給模型的輸入：

首先透過 tokenizer 將文字的 prompt 轉成 token id。
接著給定 position id，為每個 token id 標上在原本在句子中的位置 index，由於我們都是直接給整個句子，因此就給從 0 開始到 token 數量 -1 的 sequence 就好。
attention mask 的話，因為我們沒有做 padding，所以全部給 1。
最後建立 KV Cache 的輸入，因為我們這次沒有要用 KV Cache，所以 past_seq_length 的地方就直接給 0。

# token ids
input_data = tokenizer(BATCH, return_tensors="np")
input_ids = input_data["input_ids"].astype(np.int64)
batch_size, sequence_length = input_ids.shape

# positions ids
position_ids = np.tile(
    np.arange(sequence_length, dtype=np.int64), (batch_size, 1)
)

# attention mask
attention_mask = np.ones(input_ids.shape, dtype=np.int64)

# kv cache
past_key_value = np.zeros(
    (batch_size, NUM_HEAD, 0, HEAD_DIM),
    dtype=np.float32,
)

ort_input_data = {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "position_ids": position_ids,
}
for k in range(NUM_ATTN_LAYER):
    ort_input_data[f"past_key_values.{k}.key"] = past_key_value
    ort_input_data[f"past_key_values.{k}.value"] = past_key_value

接著就可以將輸入餵給模型了，並得到輸出，logits 的部分在第一個輸出，其餘的都是 KV Cache，這次我們沒有要使用，通通忽略 ~ 這邊也為了簡單，使用 greedy 的方式，直接獲取輸出中機率最高的一個 token id。

# inference and get logit
outputs = model.run(None, ort_input_data)
logits = outputs[0]

# sample the new token (greedy)
next_token_id = np.argmax(logits[:, -1, :], axis=-1)

最後要準備下一次的輸入，將新產生的 token id 加到 input_ids 的最後面，並記錄剛產出的 token id，最後要還原回句子。

# generate next inference input
input_ids = np.concatenate((input_ids, next_token_id[:, None]), axis=-1)

# record full sentence
complete_sentence = np.concatenate(
    (complete_sentence, next_token_id[:, None]), axis=-1
)

等到 token 都產生完就可以將句子 decode 並輸出啦 ~

for sentence in complete_sentence:
    print(f"Output: {tokenizer.decode(sentence)}")

Full Code:

from timeit import default_timer as timer
import platform
import os

import onnxruntime
import numpy as np
from transformers import AutoTokenizer, AutoConfig

if platform.system() == "Windows":
    import onnxruntime.tools.add_openvino_win_libs as utils

    utils.add_openvino_libs_to_path()

MODEL_PATH = "Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = onnxruntime.InferenceSession(
    os.path.join(MODEL_PATH, "onnx/model.onnx"),
    providers=["OpenVINOExecutionProvider"],
    provider_options=[{"device_type": "GPU", "precision": "FP32"}],
)
config = AutoConfig.from_pretrained(MODEL_PATH)

PROMPT = "Today is a good day"
BATCH = [PROMPT, PROMPT]
MAX_OUTPUT_LEN = 10

NUM_ATTN_LAYER = config.num_hidden_layers
NUM_HEAD = config.num_key_value_heads
HEAD_DIM = config.hidden_size // config.num_attention_heads

# initial values
input_data = tokenizer(BATCH, return_tensors="np")
input_ids = input_data["input_ids"].astype(np.int64)

for prompt in BATCH:
    print(f"Input: {prompt}")

start_timestamp = timer()

complete_sentence = input_ids
for i in range(MAX_OUTPUT_LEN):

    # set input_ids, attention_mask, position_ids
    batch_size, sequence_length = input_ids.shape
    ort_input_data = {
        "input_ids": input_ids,
        "attention_mask": np.ones(input_ids.shape, dtype=np.int64),
        "position_ids": np.tile(
            np.arange(sequence_length, dtype=np.int64), (batch_size, 1)
        ),
    }

    # set kv cache values
    for k in range(NUM_ATTN_LAYER):
        past_key_value = np.zeros(
            (batch_size, NUM_HEAD, 0, HEAD_DIM),
            dtype=np.float32,
        )
        ort_input_data[f"past_key_values.{k}.key"] = past_key_value
        ort_input_data[f"past_key_values.{k}.value"] = past_key_value

    # inference and get logit
    outputs = model.run(None, ort_input_data)
    logits = outputs[0]

    # sample the new token (greedy)
    next_token_id = np.argmax(logits[:, -1, :], axis=-1)

    # generate next inference input
    input_ids = np.concatenate((input_ids, next_token_id[:, None]), axis=-1)

    # record full sentence
    complete_sentence = np.concatenate(
        (complete_sentence, next_token_id[:, None]), axis=-1
    )

end_timestamp = timer()

for sentence in complete_sentence:
    print(f"Output: {tokenizer.decode(sentence)}")

print(f"Elapsed time: {end_timestamp - start_timestamp} (s)")

Inference with KV Cache

如果要使用 KV Cache 的話，需要紀錄每次模型輸出的 KV Cache，因此我們在每次輸出後，都加在我們 past_key_values 的最後面。

# store new kv cache
new_kv_cache_length = logits.shape[-2]
for k in range(NUM_ATTN_LAYER * 2):
    new_key_values = outputs[k + 1][:, :, -new_kv_cache_length:, :]
    past_key_values[k] = np.concatenate(
        (past_key_values[k], new_key_values), axis=2
    )
past_seq_length += new_kv_cache_length

接著，由於我們使用 KV Cache，我們也不必要輸入前面的 token 了，下一次 inference 只需要給模型這次剛產生的新 token 就好，因此我們也要調整一下我們下次 inference 的輸入，position_ids 則是要給這次新產出的 token 在句子裡的 index，就不能從 0 開始，attention mask 則是改為只給 1 個 1，因為下次輸入的 token 只有一個。

# generate next inference input
input_ids = next_token_id[:, None].astype(np.int64)
position_ids = np.full((batch_size, 1), past_seq_length, dtype=np.int64)
attention_mask = np.ones(input_ids.shape, dtype=np.int64)

Full Code:

from timeit import default_timer as timer
import platform
import os

import onnxruntime
import numpy as np
from transformers import AutoTokenizer, AutoConfig

if platform.system() == "Windows":
    import onnxruntime.tools.add_openvino_win_libs as utils

    utils.add_openvino_libs_to_path()

MODEL_PATH = "Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = onnxruntime.InferenceSession(
    os.path.join(MODEL_PATH, "onnx/model.onnx"),
    providers=["OpenVINOExecutionProvider"],
    provider_options=[{"device_type": "GPU", "precision": "FP16"}],
)
config = AutoConfig.from_pretrained(MODEL_PATH)

PROMPT = "Today is a good day"
BATCH = [PROMPT, PROMPT]
MAX_OUTPUT_LEN = 10

NUM_ATTN_LAYER = config.num_hidden_layers
NUM_HEAD = config.num_key_value_heads
HEAD_DIM = config.hidden_size // config.num_attention_heads

# initial values
input_data = tokenizer(BATCH, return_tensors="np")
input_ids = input_data["input_ids"].astype(np.int64)
batch_size, sequence_length = input_ids.shape

position_ids = np.tile(
    np.arange(sequence_length, dtype=np.int64), (batch_size, 1)
)

past_seq_length = 0
past_key_values = [
    np.zeros(
        (batch_size, NUM_HEAD, past_seq_length, HEAD_DIM),
        dtype=np.float32,
    )
    for _ in range(NUM_ATTN_LAYER * 2)
]

for prompt in BATCH:
    print(f"Input: {prompt}")

start_timestamp = timer()

complete_sentence = input_ids
for i in range(MAX_OUTPUT_LEN):

    # set input_ids, attention_mask, position_ids
    ort_input_data = {
        "input_ids": input_ids,
        "attention_mask": np.ones(input_ids.shape, dtype=np.int64),
        "position_ids": position_ids,
    }

    # set kv cache values
    for k in range(NUM_ATTN_LAYER):
        ort_input_data[f"past_key_values.{k}.key"] = past_key_values[2 * k]
        ort_input_data[f"past_key_values.{k}.value"] = past_key_values[
            2 * k + 1
        ]

    # inference and get logit
    outputs = model.run(None, ort_input_data)
    logits = outputs[0]

    # store new kv cache
    new_kv_cache_length = logits.shape[-2]
    for k in range(NUM_ATTN_LAYER * 2):
        new_key_values = outputs[k + 1][:, :, -new_kv_cache_length:, :]
        past_key_values[k] = np.concatenate(
            (past_key_values[k], new_key_values), axis=2
        )
    past_seq_length += new_kv_cache_length

    # sample the new token (greedy)
    next_token_id = np.argmax(logits[:, -1, :], axis=-1)

    # generate next inference input
    input_ids = next_token_id[:, None].astype(np.int64)
    position_ids = np.full((batch_size, 1), past_seq_length, dtype=np.int64)

    # record full sentence
    complete_sentence = np.concatenate(
        (complete_sentence, next_token_id[:, None]), axis=-1
    )

end_timestamp = timer()

for sentence in complete_sentence:
    print(f"Output: {tokenizer.decode(sentence)}")

print(f"Elapsed time: {end_timestamp - start_timestamp} (s)")

Speed

最後我們來測試一下速度，一樣是使用 Llama 3.2 1B。

Parameter	Value
Model	Llama 3.2 1B
Format	ONNX
dtype	FP32
Device	CPU
Backend	OpenVINO
Max Output Token	10

由此可知，使用 KV Cache 的速度會比較快，但相對的也消耗記憶體，典型的用空間換時間的方法。

Reference

廣告 AD

目錄

目錄

加速大型語言模型：運用 ONNX 格式推理和 KV Cache 的使用

Introduction

ONNX Runtime

Packages

Model

Inference

Inference with KV Cache

Speed

Reference

目錄

加速大型語言模型：運用 ONNX 格式推理和 KV Cache 的使用

Introduction

ONNX Runtime

Packages

Model

Inference

Inference with KV Cache

Speed

Reference

IPEX-LLM：如何使用 Intel NPU 推理大型語言模型？

YOLO11 推理速度大對決：OpenVINO vs. PyTorch vs. ONNX Runtime

OpenVINO 安裝與使用 — 加速 AI 推理！

Python Debug：用 GDB 來 Debug Python！

WAT：檢視 Python 物件，一款強大的 Debug 工具