非结构化文件 (Unstructured File)

This notebook covers how to use Unstructured package to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.

# # 安装包
pip install "unstructured[all-docs]"

# # 安装其他依赖项
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # 如果解析 xml / html 文档：
# !brew install libxml2
# !brew install libxslt

# import nltk
# nltk.download('punkt')

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")

docs = loader.load()

docs[0].page_content[:400]

    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constit'

保留元素 (Retain Elements)

在底层，Unstructured 为不同的文本块创建不同的 "元素"。默认情况下，我们将它们合并在一起，但您可以通过指定 mode="elements" 来轻松保持它们的分离。

loader = UnstructuredFileLoader(
    "./example_data/state_of_the_union.txt", mode="elements"
)

docs = loader.load()

docs[:5]

    [Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
     Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
     Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
     Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
     Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]

定义分区策略 (Define a Partitioning Strategy)

Unstructured document loader允许用户传入一个strategy参数，用于告诉unstructured如何对文档进行分区。目前支持的策略有"hi_res"（默认）和"fast"。高分辨率（hi res）的分区策略更准确，但处理时间较长。快速（fast）的分区策略可以更快地对文档进行分区，但会牺牲准确性。并非所有文档类型都有单独的高分辨率和快速分区策略。对于这些文档类型，strategy参数将被忽略。在某些情况下，如果缺少依赖项（例如文档分区的模型），高分辨率策略将回退到快速策略。您可以在下面看到如何将策略应用于UnstructuredFileLoader。

from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "layout-parser-paper-fast.pdf", strategy="fast", mode="elements"
)

docs = loader.load()

docs[:5]

    [Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
     Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
     Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
     Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
     Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]

PDF示例

处理PDF文档的方式与处理方式完全相同。Unstructured检测文件类型并提取相同类型的元素。操作模式有：

single：将所有元素的文本合并为一个（默认）
elements：保留各个元素的文本
paged：仅合并每个页面的文本

wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf", mode="elements"
)

docs = loader.load()

docs[:5]

    [Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
     Document(page_content='Zejiang Shen 1 ( (ea)\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
     Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
     Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
     Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]

如果您需要在提取后对unstructured元素进行后处理，可以在实例化UnstructuredFileLoader时通过post_processors参数传入一个Element -> Element函数列表。这也适用于其他Unstructured加载器。以下是一个示例。只有在以"elements"模式运行加载器时才会应用后处理器。

from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()

docs[:5]

    [Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'}),
     Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
     Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
     Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
     Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'})]

非结构化 API (Unstructured API)

如果您想要快速开始而不需要太多设置，您可以简单地运行 pip install unstructured 并使用 UnstructuredAPIFileLoader 或 UnstructuredAPIFileIOLoader。这将使用托管的非结构化 API 处理您的文档。您可以在这里生成一个免费的非结构化 API 密钥。一旦可用，非结构化文档页面将提供生成 API 密钥的说明。如果您想要自行托管非结构化 API 或在本地运行它，请查看这里的说明。

from langchain.document_loaders import UnstructuredAPIFileLoader

filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]

loader = UnstructuredAPIFileLoader(
    file_path=filenames[0],
    api_key="FAKE_API_KEY",
)

docs = loader.load()
docs[0]

    Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})

您还可以通过非结构化 API 批量处理多个文件，只需使用 UnstructuredAPIFileLoader。

loader = UnstructuredAPIFileLoader(
    file_path=filenames,
    api_key="FAKE_API_KEY",
)

docs = loader.load()
docs[0]

    Document(page_content='Lorem ipsum dolor sit amet.\n\nThis is a test email to use for unit tests.\n\nImportant points:\n\nRoses are red\n\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})

非结构化文件 (Unstructured File)

保留元素 (Retain Elements)​

定义分区策略 (Define a Partitioning Strategy)​

PDF示例​

非结构化 API (Unstructured API)​

保留元素 (Retain Elements)

定义分区策略 (Define a Partitioning Strategy)

PDF示例

非结构化 API (Unstructured API)