Skip to main content

HuggingFace数据集

Hugging Face Hub是拥有超过5000个数据集的家园,涵盖100多种语言,可用于自然语言处理、计算机视觉和音频等广泛领域的任务。这些数据集可用于各种任务,如翻译、自动语音识别和图像分类。

本笔记本展示了如何将Hugging Face Hub数据集加载到LangChain中。

from langchain.document_loaders import HuggingFaceDatasetLoader
dataset_name = "imdb"
page_content_column = "text"


loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
data = loader.load()
data[:15]
    [Document(page_content='I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is 

### 示例
在这个示例中,我们使用数据集中的数据来回答一个问题

```python
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders.hugging_face_dataset import HuggingFaceDatasetLoader
dataset_name = "tweet_eval"
page_content_column = "text"
name = "stance_climate"

loader = HuggingFaceDatasetLoader(dataset_name, page_content_column, name)
index = VectorstoreIndexCreator().from_loaders([loader])
    找到缓存的数据集tweet_eval

0%| | 0/3 [00:00<?, ?it/s]

使用嵌入式的DuckDB而不是持久化: 数据将是临时的
query = "最常用的标签是什么?"
result = index.query(query)
result
    '在这个上下文中,最常用的标签是#UKClimate2015,#Sustainability,#TakeDownTheFlag,#LoveWins,#CSOTA,#ClimateSummitoftheAmericas,#SM和#SocialMedia。'