按Token分割
语言模型有一个标记限制。您不应超过标记限制。因此,当您将文本分成块时,最好计算标记的数量。有许多分词器。在计算文本中的标记数时,应使用与语言模型中使用的分词器相同的分词器。
tiktoken
tiktoken 是由
OpenAI
创建的快速BPE
分词器。
我们可以使用它来估计使用的标记。对于 OpenAI 模型,它可能更准确。
- 文本如何分割:按传入的字符分割
- 块大小如何测量:使用
tiktoken
分词器
#!pip install tiktoken
# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
API 参考:
- CharacterTextSplitter 来自
langchain.text_splitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
我们也可以直接加载一个 tiktoken 分词器
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
API 参考:
- TokenTextSplitter 来自
langchain.text_splitter
spaCy
spaCy 是一个用于高级自然语言处理的开源软件库,使用 Python 和 Cython 编写。
除了使用 NLTK
,另一个选择是使用 spaCy 分词器。
- 文本如何分割:使用
spaCy
分词器 - 块大小如何测量:按字符数测量
#!pip install spacy
# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
API 参考:
- SpacyTextSplitter 来自
langchain.text_splitter
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
SentenceTransformers
SentenceTransformersTokenTextSplitter
是专门用于句子转换模型的文本分割器。默认行为是将文本分割成适合您想要使用的句子转换模型的标记窗口的块。
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
API 参考:
- SentenceTransformersTokenTextSplitter 来自
langchain.text_splitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1
# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier
print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)
print(text_chunks[1])
lorem
NLTK
The Natural Language Toolkit,或者更常见的是 NLTK,是用 Python 编程语言编写的用于英语的符号和统计自然语言处理(NLP)的库和程序套件。
我们可以使用 NLTK
根据 NLTK 分词器 进行分割,而不仅仅是根据 "\n\n" 进行分割。
- 文本如何分割:使用
NLTK
分词器。 - 块大小如何测量:按字符数测量
# pip install nltk
# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
API 参考:
- NLTKTextSplitter 来自
langchain.text_splitter
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies.
Hugging Face 分词器
Hugging Face 有许多分词器。
我们使用 Hugging Face 分词器,即 GPT2TokenizerFast 来计算标记的文本长度。
- 文本如何分割:按传入的字符分割
- 块大小如何测量:由
Hugging Face
分词器计算的标记数
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
API 参考:
- CharacterTextSplitter 来自
langchain.text_splitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.