Llama.cpp

llama-cpp-python 是 llama.cpp 的 Python 绑定。它支持多个 LLMs。

本文档介绍如何在 LangChain 中运行 llama-cpp-python。

安装 (Installation)

有多种安装 llama-cpp 包的选项：

仅使用 CPU
CPU + GPU (使用多个 BLAS 后端之一)
Metal GPU (MacOS 上使用 Apple Silicon 芯片)

仅CPU安装 (CPU only installation)

pip install llama-cpp-python

使用OpenBLAS / cuBLAS / CLBlast进行安装 (Installation with OpenBLAS / cuBLAS / CLBlast)

lama.cpp支持多个BLAS后端以加快处理速度。使用FORCE_CMAKE=1环境变量来强制使用cmake并安装所需的BLAS后端的pip包 (source)。

使用cuBLAS后端的示例安装:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要提示: 如果您已经安装了仅支持CPU的版本的包，您需要从头重新安装。考虑以下命令:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

使用 Metal 进行安装 (Installation with Metal)

llama.cpp 支持 Apple Silicon，通过 ARM NEON、Accelerate 和 Metal 框架进行优化。使用 FORCE_CMAKE=1 环境变量来强制使用 cmake，并安装支持 Metal 的 pip 包 (来源)。

使用 Metal 支持进行示例安装:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要提示: 如果您已经安装了仅支持 CPU 的版本，请考虑从头重新安装它，使用以下命令:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Windows安装

稳定的安装llama-cpp-python库的方法是通过从源代码编译。您可以按照存储库本身中的大部分说明进行操作，但也有一些特定于Windows的说明可能会有用。

安装llama-cpp-python的要求：

git
python
cmake
Visual Studio Community（确保您使用以下设置安装）
- 桌面开发与C++
- Python开发
- 带C++的Linux嵌入式开发

递归克隆git存储库以获取llama.cpp子模块

git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git

打开命令提示符（或者如果您已经安装了Anaconda，则打开Anaconda提示符），设置环境变量以进行安装。如果您没有GPU，请按照以下方式设置这两个变量。

set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF

如果您有NVIDIA GPU，则可以忽略第二个环境变量。

一级标题 (Primary Title)

编译和安装 (Compiling and installing)

在设置变量的同一个命令提示符（anaconda提示符）中，您可以进入llama-cpp-python目录并运行以下命令。

python setup.py clean
python setup.py install

使用方法 (Usage)

确保您按照安装所有必要的模型文件的所有说明进行操作。

您不需要一个 API_TOKEN，因为您将在本地运行 LLM。

值得注意的是，要了解哪些模型适合在所需的机器上使用。

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

考虑使用适合您模型的模板！请查看 HuggingFace 等平台上的模型页面，以获取正确的提示模板。

template = """问题：{question}

回答：让我们逐步解决这个问题，以确保我们得到正确的答案。"""

prompt = PromptTemplate(template=template, input_variables=["question"])

# 回调函数支持逐个令牌的流式处理
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# 需要将 verbose 参数传递给回调函数管理器

CPU（中央处理器）

使用 LLaMA 2 7B 模型的示例

# 确保模型路径在您的系统上正确！
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama/llama-2-7b-ggml/llama-2-7b-chat.ggmlv3.q4_0.bin",
    input={"temperature": 0.75, "max_length": 2000, "top_p": 1},
    callback_manager=callback_manager,
    verbose=True,
)

prompt = """
问题：Stephen Colbert 和 John Oliver 之间的说唱对战
"""
llm(prompt)

    
    Stephen Colbert:
    Yo, John, I heard you've been talkin' smack about me on your show.
    Let me tell you somethin', pal, I'm the king of late-night TV
    My satire is sharp as a razor, it cuts deeper than a knife
    While you're just a british bloke tryin' to be funny with your accent and your wit.
    John Oliver:
    Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.
    My show is the one that people actually watch and listen to, not just for the laughs but for the facts.
    While you're busy talkin' trash, I'm out here bringing the truth to light.
    Stephen Colbert:
    Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.
    You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.
    While I'm the one who's really makin' a difference, with my sat

    
    llama_print_timings:        load time =   358.60 ms
    llama_print_timings:      sample time =   172.55 ms /   256 runs   (    0.67 ms per token,  1483.59 tokens per second)
    llama_print_timings: prompt eval time =   613.36 ms /    16 tokens (   38.33 ms per token,    26.09 tokens per second)
    llama_print_timings:        eval time = 10151.17 ms /   255 runs   (   39.81 ms per token,    25.12 tokens per second)
    llama_print_timings:       total time = 11332.41 ms





    "\nStephen Colbert:\nYo, John, I heard you've been talkin' smack about me on your show.\nLet me tell you somethin', pal, I'm the king of late-night TV\nMy satire is sharp as a razor, it cuts deeper than a knife\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\nJohn Oliver:\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\nStephen Colbert:\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\nWhile I'm the one who's really makin' a difference, with my sat"

使用 LLaMA v1 模型的示例

# 确保模型路径在您的系统上正确！
llm = LlamaCpp(
    model_path="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "贾斯汀·比伯出生的那一年，哪支美国橄榄球队赢得了超级碗？"

llm_chain.run(question)

    
    1. 首先，找出贾斯汀·比伯的出生年份。
    2. 我们知道贾斯汀·比伯出生于1994年3月1日。
    3. 接下来，我们需要查找当年超级碗的比赛时间。
    4. 超级碗在1995年1月28日举行。
    5. 最后，我们可以利用这些信息来回答问题。贾斯汀·比伯出生年份的超级碗冠军是旧金山49人队。

    llama_print_timings:        load time =   434.15 ms
    llama_print_timings:      sample time =    41.81 ms /   121 runs   (    0.35 ms per token)
    llama_print_timings: prompt eval time =  2523.78 ms /    48 tokens (   52.58 ms per token)
    llama_print_timings:        eval time = 23971.57 ms /   121 runs   (  198.11 ms per token)
    llama_print_timings:       total time = 28945.95 ms

    '\n\n1. 首先，找出贾斯汀·比伯的出生年份。\n2. 我们知道贾斯汀·比伯出生于1994年3月1日。\n3. 接下来，我们需要查找当年超级碗的比赛时间。\n4. 超级碗在1995年1月28日举行。\n5. 最后，我们可以利用这些信息来回答问题。贾斯汀·比伯出生年份的超级碗冠军是旧金山49人队。'

GPU (图形处理器)

如果使用BLAS后端进行安装正确，您将在模型属性中看到BLAS = 1指示器。

与GPU一起使用的两个最重要的参数是：

n_gpu_layers - 确定模型的多少层被卸载到GPU上。
n_batch - 并行处理的标记数量。

正确设置这些参数将显著提高评估速度（有关更多详细信息，请参见包装器代码）。

n_gpu_layers = 40  # 根据您的模型和GPU VRAM池的情况更改此值。
n_batch = 512  # 应该在1和n_ctx之间，考虑您的GPU中的VRAM数量。

# 确保模型路径对于您的系统是正确的！
llm = LlamaCpp(
    model_path="./ggml-model-q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"

llm_chain.run(question)

     我们正在寻找一个在贾斯汀·比伯（1994年3月1日出生）出生年份赢得超级碗的NFL球队。
    
    首先，让我们查找最接近贾斯汀·比伯出生年份的年份：
    
    * 他出生前一年：1993年
    * 他出生的那一年：1994年
    * 他出生后一年：1995年
    
    我们想知道在最接近贾斯汀·比伯出生年份的那一年，哪个NFL球队赢得了超级碗。因此，我们应该查找在1993年或1994年赢得超级碗的NFL球队。
    
    现在让我们找出在这两年中哪个NFL球队赢得了超级碗：
    
    * 1993年，旧金山49人队以20-16的比分击败达拉斯牛仔队赢得了超级碗。
    * 1994年，旧金山49人队再次赢得了超级碗，这次他们以49-26的比分击败了圣地亚哥充电器队。

    llama_print_timings:        load time =   238.10 ms
    llama_print_timings:      sample time =    84.23 ms /   256 runs   (    0.33 ms per token)
    llama_print_timings: prompt eval time =   238.04 ms /    49 tokens (    4.86 ms per token)
    llama_print_timings:        eval time = 10391.96 ms /   255 runs   (   40.75 ms per token)
    llama_print_timings:       total time = 15664.80 ms

    " 我们正在寻找一个在贾斯汀·比伯（1994年3月1日出生）出生年份赢得超级碗的NFL球队。 \n\n首先，让我们查找最接近贾斯汀·比伯出生年份的年份：\n\n* 他出生前一年：1993年\n* 他出生的那一年：1994年\n* 他出生后一年：1995年\n\n我们想知道在最接近贾斯汀·比伯出生年份的那一年，哪个NFL球队赢得了超级碗。因此，我们应该查找在1993年或1994年赢得超级碗的NFL球队。\n\n现在让我们找出在这两年中哪个NFL球队赢得了超级碗：\n\n* 1993年，旧金山49人队以20-16的比分击败达拉斯牛仔队赢得了超级碗。\n* 1994年，旧金山49人队再次赢得了超级碗，这次他们以49-26的比分击败了圣地亚哥充电器队。\n"

Metal (金属)

如果Metal的安装正确，您将在模型属性中看到一个NEON = 1的指示器。

两个最重要的GPU参数是：

n_gpu_layers - 确定将模型的多少层卸载到您的Metal GPU中，在大多数情况下，将其设置为1对于Metal来说已经足够了。
n_batch - 并行处理的标记数量，默认为8，可以设置为更大的数字。
f16_kv - 由于某些原因，Metal只支持True，否则您将会遇到错误，例如Asserting on type 0 GGML_ASSERT: .../ggml-metal.m:706: false && "not implemented"

正确设置这些参数将显著提高评估速度（有关更多详细信息，请参见包装器代码）。

n_gpu_layers = 1  # Metal设置为1就足够了。
n_batch = 512  # 应该在1和n_ctx之间，考虑您的Apple Silicon芯片的RAM量。

# 确保模型路径对于您的系统是正确的！
llm = LlamaCpp(
    model_path="./ggml-model-q4_0.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    f16_kv=True,  # 必须设置为True，否则在几次调用后会遇到问题
    callback_manager=callback_manager,
    verbose=True,
)

控制台日志将显示以下日志以指示Metal已正确启用。

ggml_metal_init: allocating
ggml_metal_init: using MPS
...

您还可以通过观察进程的GPU使用情况来检查Activity Monitor，在打开n_gpu_layers=1后，CPU使用率将显著下降。

对于第一次调用LLM，由于在Metal GPU中进行模型编译，性能可能会较慢。

Llama.cpp

安装 (Installation)​

仅CPU安装 (CPU only installation)​

使用OpenBLAS / cuBLAS / CLBlast进行安装 (Installation with OpenBLAS / cuBLAS / CLBlast)​

使用 Metal 进行安装 (Installation with Metal)​

Windows安装​

编译和安装 (Compiling and installing)​

使用方法 (Usage)​

CPU（中央处理器）​

GPU (图形处理器)​

Metal (金属)​