Skip to main content

非结构化文件 (Unstructured File)

This notebook covers how to use Unstructured package to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.

# # 安装包
pip install "unstructured[all-docs]"
# # 安装其他依赖项
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # 如果解析 xml / html 文档:
# !brew install libxml2
# !brew install libxslt
# import nltk
# nltk.download('punkt')
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")
docs = loader.load()
docs[0].page_content[:400]
    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constit'

保留元素 (Retain Elements)

在底层,Unstructured 为不同的文本块创建不同的 "元素"。默认情况下,我们将它们合并在一起,但您可以通过指定 mode="elements" 来轻松保持它们的分离。

loader = UnstructuredFileLoader(
"./example_data/state_of_the_union.txt", mode="elements"
)
docs = loader.load()
docs[:5]
    [Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]

定义分区策略 (Define a Partitioning Strategy)

Unstructured document loader允许用户传入一个strategy参数,用于告诉unstructured如何对文档进行分区。目前支持的策略有"hi_res"(默认)和"fast"。高分辨率(hi res)的分区策略更准确,但处理时间较长。快速(fast)的分区策略可以更快地对文档进行分区,但会牺牲准确性。并非所有文档类型都有单独的高分辨率和快速分区策略。对于这些文档类型,strategy参数将被忽略。在某些情况下,如果缺少依赖项(例如文档分区的模型),高分辨率策略将回退到快速策略。您可以在下面看到如何将策略应用于UnstructuredFileLoader

from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader(
"layout-parser-paper-fast.pdf", strategy="fast", mode="elements"
)
docs = loader.load()
docs[:5]
    [Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]

PDF示例

处理PDF文档的方式与处理方式完全相同。Unstructured检测文件类型并提取相同类型的元素。操作模式有:

  • single:将所有元素的文本合并为一个(默认)
  • elements:保留各个元素的文本
  • paged:仅合并每个页面的文本
wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"
loader = UnstructuredFileLoader(
"./example_data/layout-parser-paper.pdf", mode="elements"
)
docs = loader.load()
docs[:5]
    [Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Zejiang Shen 1 ( (ea)\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]

如果您需要在提取后对unstructured元素进行后处理,可以在实例化UnstructuredFileLoader时通过post_processors参数传入一个Element -> Element函数列表。这也适用于其他Unstructured加载器。以下是一个示例。只有在以"elements"模式运行加载器时才会应用后处理器。

from langchain.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace
loader = UnstructuredFileLoader(
"./example_data/layout-parser-paper.pdf",
mode="elements",
post_processors=[clean_extra_whitespace],
)
docs = loader.load()
docs[:5]
    [Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'}),
Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'filename': 'layout-parser-paper.pdf', 'file_directory': './example_data', 'filetype': 'application/pdf', 'page_number': 1, 'category': 'Title'})]

非结构化 API (Unstructured API)

如果您想要快速开始而不需要太多设置,您可以简单地运行 pip install unstructured 并使用 UnstructuredAPIFileLoaderUnstructuredAPIFileIOLoader。这将使用托管的非结构化 API 处理您的文档。您可以在这里生成一个免费的非结构化 API 密钥。一旦可用,非结构化文档页面将提供生成 API 密钥的说明。如果您想要自行托管非结构化 API 或在本地运行它,请查看这里的说明。

from langchain.document_loaders import UnstructuredAPIFileLoader
filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]
loader = UnstructuredAPIFileLoader(
file_path=filenames[0],
api_key="FAKE_API_KEY",
)
docs = loader.load()
docs[0]
    Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})

您还可以通过非结构化 API 批量处理多个文件,只需使用 UnstructuredAPIFileLoader

loader = UnstructuredAPIFileLoader(
file_path=filenames,
api_key="FAKE_API_KEY",
)
docs = loader.load()
docs[0]
    Document(page_content='Lorem ipsum dolor sit amet.\n\nThis is a test email to use for unit tests.\n\nImportant points:\n\nRoses are red\n\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})