Skip to main content

按字符分割

这是最简单的方法。它根据字符(默认为"\n\n")进行分割,并通过字符数量来测量块的长度。

  1. 文本如何分割:按单个字符
  2. 块大小如何测量:按字符数量
# 这是一个长文档,我们可以将其分割。
with open('../../../state_of_the_union.txt') as f:
state_of_the_union = f.read()

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

以下是将元数据与文档一起传递的示例,请注意它与文档一起分割。

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])
text_splitter.split_text(state_of_the_union)[0]