Skip to main content

使用本地LLM

The popularity of projects like PrivateGPT, llama.cpp, and GPT4All underscore the importance of running LLMs locally.
(类似于PrivateGPT、llama.cpp和GPT4All等项目的流行程度凸显了在本地运行LLM的重要性。)

LangChain has integrations with many open source LLMs that can be run locally.
(LangChain与许多可以在本地运行的开源LLM进行了集成。)

For example, here we show how to run GPT4All or Llama-v2 locally (e.g., on your laptop) using local embeddings and a local LLM.
(例如,这里我们展示了如何使用本地嵌入和本地LLM在本地(例如,在您的笔记本电脑上)运行GPT4AllLlama-v2。)

文档加载 (Document Loading)

首先,安装本地嵌入和向量存储所需的软件包。

pip install gpt4all
pip install chromadb

加载并拆分示例文档。

我们将使用一篇关于代理的博客文章作为示例。

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

接下来,以下步骤将在本地下载 GPT4All 嵌入(如果您尚未拥有它们)。

from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
    Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin

测试相似性搜索是否与我们的本地嵌入一起工作。

question = "What are the approaches to Task Decomposition?"
docs = vectorstore.similarity_search(question)
len(docs)
    4
docs[0]
    Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log"})

模型 (Model)

Llama-v2(Llama-v2)

下载一个GGML转换的模型(例如,这里)。

pip install llama-cpp-python

要在Apple Silicon上启用GPU使用,请按照这里的步骤使用带有Metal支持的Python绑定。

特别要确保conda正在使用您创建的正确虚拟环境(miniforge3)。

例如,对于我来说:

conda activate /Users/rlm/miniforge3/envs/llama

确认后:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

根据llama.cpp文档中的说明设置模型参数。

n_gpu_layers = 1  # Metal设置为1就足够了。
n_batch = 512 # 应该在1和n_ctx之间,考虑您的Apple Silicon芯片的RAM量。
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# 确保模型路径对于您的系统是正确的!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=2048,
f16_kv=True, # 必须设置为True,否则在几次调用后会遇到问题
callback_manager=callback_manager,
verbose=True,
)
    llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 8819.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x76add7460
ggml_metal_init: loaded kernel_mul 0x76add5090
ggml_metal_init: loaded kernel_mul_row 0x76addae00
ggml_metal_init: loaded kernel_scale 0x76adb2940
ggml_metal_init: loaded kernel_silu 0x76adb8610
ggml_metal_init: loaded kernel_relu 0x76addb700
ggml_metal_init: loaded kernel_gelu 0x76addc100
ggml_metal_init: loaded kernel_soft_max 0x76addcb80
ggml_metal_init: loaded kernel_diag_mask_inf 0x76addd600
ggml_metal_init: loaded kernel_get_rows_f16 0x295f16380
ggml_metal_init: loaded kernel_get_rows_q4_0 0x295f165e0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x295f16840
ggml_metal_init: loaded kernel_get_rows_q2_K 0x295f16aa0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x295f16d00
ggml_metal_init: loaded kernel_get_rows_q4_K 0x295f16f60
ggml_metal_init: loaded kernel_get_rows_q5_K 0x295f171c0
ggml_metal_init: loaded kernel_get_rows_q6_K 0x295f17420
ggml_metal_init: loaded kernel_rms_norm 0x295f17680
ggml_metal_init: loaded kernel_norm 0x295f178e0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x295f17b40
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x295f17da0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x295f18000
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x7962b9900
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x7962bf5f0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x7962bc630
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x142045960
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7962ba2b0
ggml_metal_init: loaded kernel_rope 0x7962c35f0
ggml_metal_init: loaded kernel_alibi_f32 0x7962c30b0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x7962c15b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x7962beb10
ggml_metal_init: loaded kernel_cpy_f16_f16 0x7962bf060
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, (35852.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1026.00 MB, (36878.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, (38480.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 298.00 MB, (38778.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, (39290.94 / 21845.34), warning: current allocated size is greater than the recommended max working set size

请注意,这些指示正确启用了Metal

ggml_metal_init: allocating
ggml_metal_init: using MPS
prompt = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm(prompt)
    Llama.generate: prefix-match hit



Setting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.

Stephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!
John Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!
The battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:
Stephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have


llama_print_timings: load time = 2201.54 ms
llama_print_timings: sample time = 182.54 ms / 256 runs ( 0.71 ms per token, 1402.41 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 8484.62 ms / 256 runs ( 33.14 ms per token, 30.17 tokens per second)
llama_print_timings: total time = 9000.62 ms





"\nSetting: The Late Show with Stephen Colbert. The studio audience is filled with fans of both comedians, and the energy is electric. The two comedians are seated at a table, ready to begin their epic rap battle.\n\nStephen Colbert: (smirking) Oh, you think you can take me down, John? You're just a Brit with a funny accent, and I'm the king of comedy!\nJohn Oliver: (grinning) Oh, you think you're tough, Stephen? You're just a has-been from South Carolina, and I'm the future of comedy!\nThe battle begins, with each comedian delivering clever rhymes and witty insults. Here are a few lines that might be included:\nStephen Colbert: (rapping) You may have a big brain, John, but you can't touch my charm / I've got the audience in stitches, while you're just a blemish on the screen / Your accent is so thick, it's like trying to hear a speech through a mouthful of marshmallows / You may have"

GPT4All(全能GPT)

同样地,我们可以使用GPT4All

下载GPT4All模型二进制文件

GPT4All上的模型浏览器是选择和下载模型的好方法。

然后,指定你下载的路径。

例如,对我来说,模型位于这里:

/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin

from langchain.llms import GPT4All

llm = GPT4All(
model="/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin",
max_tokens=2048,
)
    在 /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin 找到模型文件


objc[47842]: Class GGMLMetalClass is implemented in both /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x29f48c208) and /Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x29f970208). One of the two will be used. Which one is undefined.
llama.cpp: 使用 Metal
llama.cpp: 从 /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin 加载模型
llama_model_load_internal: 格式 = ggjt v3 (最新)
llama_model_load_internal: 词汇数 = 32001
llama_model_load_internal: 上下文长度 = 2048
llama_model_load_internal: 嵌入维度 = 5120
llama_model_load_internal: 多头数 = 256
llama_model_load_internal: 层数 = 40
llama_model_load_internal: 旋转数 = 128
llama_model_load_internal: 类型 = 2 (主要是 Q4_0)
llama_model_load_internal: 前馈层维度 = 13824
llama_model_load_internal: 部分数 = 1
llama_model_load_internal: 模型大小 = 13B
llama_model_load_internal: ggml 上下文大小 = 0.09 MB
llama_model_load_internal: 所需内存 = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv 自身大小 = 1600.00 MB
ggml_metal_init: 分配中
ggml_metal_init: 使用 MPS
ggml_metal_init: 加载 '/Users/rlm/anaconda3/envs/lcn2/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/ggml-metal.metal'
ggml_metal_init: 加载 kernel_add 0x115fcbfb0
ggml_metal_init: 加载 kernel_mul 0x115fcd4a0
ggml_metal_init: 加载 kernel_mul_row 0x115fce850
ggml_metal_init: 加载 kernel_scale 0x115fcd700
ggml_metal_init: 加载 kernel_silu 0x115fcd960
ggml_metal_init: 加载 kernel_relu 0x115fcfd50
ggml_metal_init: 加载 kernel_gelu 0x115fd03c0
ggml_metal_init: 加载 kernel_soft_max 0x115fcf640
ggml_metal_init: 加载 kernel_diag_mask_inf 0x115fd07f0
ggml_metal_init: 加载 kernel_get_rows_f16 0x1147b2450
ggml_metal_init: 加载 kernel_get_rows_q4_0 0x11479d1d0
ggml_metal_init: 加载 kernel_get_rows_q4_1 0x1147ad1f0
ggml_metal_init: 加载 kernel_get_rows_q2_k 0x1147aef50
ggml_metal_init: 加载 kernel_get_rows_q3_k 0x1147af1b0
ggml_metal_init: 加载 kernel_get_rows_q4_k 0x1147af410
ggml_metal_init: 加载 kernel_get_rows_q5_k 0x1147affa0
ggml_metal_init: 加载 kernel_get_rows_q6_k 0x1147b0200
ggml_metal_init: 加载 kernel_rms_norm 0x1147b0460
ggml_metal_init: 加载 kernel_norm 0x1147bfc90
ggml_metal_init: 加载 kernel_mul_mat_f16_f32 0x1147c0230
ggml_metal_init: 加载 kernel_mul_mat_q4_0_f32 0x1147c0490
ggml_metal_init: 加载 kernel_mul_mat_q4_1_f32 0x1147c06f0
ggml_metal_init: 加载 kernel_mul_mat_q2_k_f32 0x1147c0950
ggml_metal_init: 加载 kernel_mul_mat_q3_k_f32 0x1147c0bb0
ggml_metal_init: 加载 kernel_mul_mat_q4_k_f32 0x1147c0e10
ggml_metal_init: 加载 kernel_mul_mat_q5_k_f32 0x1147c1070
ggml_metal_init: 加载 kernel_mul_mat_q6_k_f32 0x1147c13d0
ggml_metal_init: 加载 kernel_rope 0x1147c1a00
ggml_metal_init: 加载 kernel_alibi_f32 0x1147c2120
ggml_metal_init: 加载 kernel_cpy_f32_f16 0x115fd1690
ggml_metal_init: 加载 kernel_cpy_f32_f32 0x115fd1c60
ggml_metal_init: 加载 kernel_cpy_f16_f16 0x115fd2d40
ggml_metal_init: 推荐的最大工作集大小 = 21845.34 MB
ggml_metal_init: 具有统一内存 = true
ggml_metal_init: 最大传输速率 = 内置 GPU
ggml_metal_add_buffer: 分配 'data ' 缓冲区,大小 = 6984.06 MB, ( 6984.45 / 21845.34)
ggml_metal_add_buffer: 分配 'eval ' 缓冲区,大小 = 1024.00 MB, ( 8008.45 / 21845.34)
ggml_metal_add_buffer: 分配 'kv ' 缓冲区,大小 = 1602.00 MB, ( 9610.45 / 21845.34)
ggml_metal_add_buffer: 分配 'scr0 ' 缓冲区,大小 = 512.00 MB, (10122.45 / 21845.34)
ggml_metal_add_buffer: 分配 'scr1 ' 缓冲区,大小 = 512.00 MB, (10634.45 / 21845.34)

LLMChain

运行一个LLMChain(参见这里),通过传入检索到的文档和一个简单的提示来运行。

它使用提供的输入键值对格式化提示模板,并将格式化后的字符串传递给GPT4AllLLama-V2或其他指定的LLM。

在这种情况下,上面的检索文档列表(docs)被传递给{context}

from langchain import PromptTemplate, LLMChain

# 提示
prompt = PromptTemplate.from_template(
"总结这些检索到的文档中的主题:{docs}"
)

# 链
llm_chain = LLMChain(llm=llm, prompt=prompt)

# 运行
question = "任务分解的方法有哪些?"
docs = vectorstore.similarity_search(question)
result = llm_chain(docs)

# 输出
result["text"]
    Llama.generate: 前缀匹配命中



根据检索到的文档,主要主题有:
1. 任务分解:将复杂任务分解为较小的子任务,可以由LLM或其他代理系统的组件处理。
2. LLM作为核心控制器:使用大型语言模型(LLM)作为自主代理系统的主要控制器,辅以其他关键组件,如知识图和规划器。
3. LLM的潜力:LLM有潜力成为强大的通用问题解决器,不仅可以生成写作流畅的副本,还可以解决复杂任务并实现类似人类的智能。
4. 长期规划中的挑战:在长时间历史上进行规划和有效地探索解空间的挑战,这是当前基于LLM的自主代理系统的重要限制。


llama_print_timings: 加载时间 = 1191.88 毫秒
llama_print_timings: 采样时间 = 134.47 毫秒 / 193 次 ( 0.70 毫秒每个标记, 1435.25 标记每秒)
llama_print_timings: 提示评估时间 = 39470.18 毫秒 / 1055 标记 ( 37.41 毫秒每个标记, 26.73 标记每秒)
llama_print_timings: 评估时间 = 8090.85 毫秒 / 192 次 ( 42.14 毫秒每个标记, 23.73 标记每秒)
llama_print_timings: 总时间 = 47943.12 毫秒





'\n根据检索到的文档,主要主题有:\n1. 任务分解:将复杂任务分解为较小的子任务,可以由LLM或其他代理系统的组件处理。\n2. LLM作为核心控制器:使用大型语言模型(LLM)作为自主代理系统的主要控制器,辅以其他关键组件,如知识图和规划器。\n3. LLM的潜力:LLM有潜力成为强大的通用问题解决器,不仅可以生成写作流畅的副本,还可以解决复杂任务并实现类似人类的智能。\n4. 长期规划中的挑战:在长时间历史上进行规划和有效地探索解空间的挑战,这是当前基于LLM的自主代理系统的重要限制。'

QA链

我们可以使用QA链来处理上面的问题。

chain_type="stuff"(参见这里)意味着所有文档将被添加(填充)到提示中。

from langchain.chains.question_answering import load_qa_chain

# 提示
template = """使用以下上下文片段回答最后的问题。
如果你不知道答案,只需说你不知道,不要试图编造一个答案。
最多使用三个句子,并尽量简洁地回答。
在回答的结尾始终说"谢谢你的提问!"。
{context}
问题:{question}
有用的回答:"""
QA_CHAIN_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template=template,
)

# 链
chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_CHAIN_PROMPT)

# 运行
chain({"input_documents": docs, "question": question}, return_only_outputs=True)
    Llama.generate: 前缀匹配命中


嗨!任务分解有三种主要方法。一种是使用LLM和简单提示,比如"XYZ的步骤"或"实现XYZ的子目标是什么?"。另一种方法是使用任务特定的说明,比如为写小说而写"写一个故事大纲"。最后,任务分解也可以通过人工输入来完成。谢谢你的提问!


llama_print_timings: 加载时间 = 1191.88 毫秒
llama_print_timings: 采样时间 = 61.21 毫秒 / 85 次 ( 0.72 毫秒每个标记, 1388.64 标记每秒)
llama_print_timings: 提示评估时间 = 8014.11 毫秒 / 267 标记 ( 30.02 毫秒每个标记, 33.32 标记每秒)
llama_print_timings: 评估时间 = 2908.17 毫秒 / 84 次 ( 34.62 毫秒每个标记, 28.88 标记每秒)
llama_print_timings: 总时间 = 11096.23 毫秒





{'output_text': ' 嗨!任务分解有三种主要方法。一种是使用LLM和简单提示,比如"XYZ的步骤"或"实现XYZ的子目标是什么?"。另一种方法是使用任务特定的说明,比如为写小说而写"写一个故事大纲"。最后,任务分解也可以通过人工输入来完成。谢谢你的提问!'}

检索问答(RetrievalQA)

为了更简化的流程,使用RetrievalQA

这将使用一个QA默认提示(在这里显示),并从向量数据库中检索。

但是,如果需要的话,仍然可以传入一个提示。

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)
qa_chain({"query": question})
    Llama.generate: 前缀匹配命中



任务分解的三种方法是LLMs与简单提示、任务特定指令或人工输入。谢谢你的提问!


llama_print_timings: 加载时间 = 1191.88 毫秒
llama_print_timings: 采样时间 = 22.78 毫秒 / 31 次 ( 0.73 毫秒每个标记, 1360.66 标记每秒)
llama_print_timings: 提示评估时间 = 0.00 毫秒 / 1 标记 ( 0.00 毫秒每个标记, 无穷大标记每秒)
llama_print_timings: 评估时间 = 1320.23 毫秒 / 31 次 ( 42.59 毫秒每个标记, 23.48 标记每秒)
llama_print_timings: 总时间 = 1387.70 毫秒





{'query': '任务分解的方法有哪些?',
'result': ' \n任务分解的三种方法是LLMs与简单提示、任务特定指令或人工输入。谢谢你的提问!'}