Skip to main content

按Token分割

语言模型有一个标记限制。您不应超过标记限制。因此,当您将文本分成块时,最好计算标记的数量。有许多分词器。在计算文本中的标记数时,应使用与语言模型中使用的分词器相同的分词器。

tiktoken

tiktoken 是由 OpenAI 创建的快速 BPE 分词器。

我们可以使用它来估计使用的标记。对于 OpenAI 模型,它可能更准确。

  1. 文本如何分割:按传入的字符分割
  2. 块大小如何测量:使用 tiktoken 分词器
#!pip install tiktoken

# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

API 参考:

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. 

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

我们也可以直接加载一个 tiktoken 分词器

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API 参考:

spaCy

spaCy 是一个用于高级自然语言处理的开源软件库,使用 Python 和 Cython 编写。

除了使用 NLTK,另一个选择是使用 spaCy 分词器

  1. 文本如何分割:使用 spaCy 分词器
  2. 块大小如何测量:按字符数测量
#!pip install spacy

# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

API 参考:

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. 

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.



Last year COVID-19 kept us apart.

This year we are finally together again.



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.



With a duty to one another to the American people to the Constitution.



And with an unwavering resolve that freedom will always triumph over tyranny.



Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.



He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

SentenceTransformers

SentenceTransformersTokenTextSplitter 是专门用于句子转换模型的文本分割器。默认行为是将文本分割成适合您想要使用的句子转换模型的标记窗口的块。

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

API 参考:

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])
lorem

NLTK

The Natural Language Toolkit,或者更常见的是 NLTK,是用 Python 编程语言编写的用于英语的符号和统计自然语言处理(NLP)的库和程序套件。

我们可以使用 NLTK 根据 NLTK 分词器 进行分割,而不仅仅是根据 "\n\n" 进行分割。

  1. 文本如何分割:使用 NLTK 分词器。
  2. 块大小如何测量:按字符数测量
# pip install nltk

# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()

from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

API 参考:

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. 

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.

Hugging Face 分词器

Hugging Face 有许多分词器。

我们使用 Hugging Face 分词器,即 GPT2TokenizerFast 来计算标记的文本长度。

  1. 文本如何分割:按传入的字符分割
  2. 块大小如何测量:由 Hugging Face 分词器计算的标记数
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# 这是一个长文档,我们可以将其分割。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

API 参考:

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. 

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.