Skip to main content

Doctran提取属性 (Doctran Extract Properties)

我们可以使用Doctran库提取文档的有用特征,该库使用OpenAI的函数调用功能来提取特定的元数据。

从文档中提取元数据对于各种任务都很有帮助,包括:

  • 分类:将文档分类到不同的类别中
  • 数据挖掘:提取可用于数据分析的结构化数据
  • 风格转换:改变文本的写作方式,使其更接近预期的用户输入,从而改善向量搜索结果
pip install doctran
import json
from langchain.schema import Document
from langchain.document_transformers import DoctranPropertyExtractor
from dotenv import load_dotenv

load_dotenv()
    True

输入 (Input)

这是我们将从中提取属性的文档。

sample_text = """[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Employee Benefits
Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Research and Development Projects
In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.

Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev
"""
print(sample_text)
    [Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Employee Benefits
Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Research and Development Projects
In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.

Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev

documents = [Document(page_content=sample_text)]
properties = [
{
"name": "category",
"description": "这封电子邮件的类型。",
"type": "string",
"enum": ["更新", "行动项", "客户反馈", "公告", "其他"],
"required": True,
},
{
"name": "mentions",
"description": "此电子邮件中提到的所有人的列表。",
"type": "array",
"items": {
"name": "full_name",
"description": "提到的人的全名。",
"type": "string",
},
"required": True,
},
{
"name": "eli5",
"description": "用5岁孩子的语言解释这封电子邮件。",
"type": "string",
"required": True,
},
]
property_extractor = DoctranPropertyExtractor(properties=properties)

输出 (Output)

从文档中提取属性后,结果将作为具有元数据提供的新文档返回。

extracted_document = await property_extractor.atransform_documents(
documents, properties=properties
)
print(json.dumps(extracted_document[0].metadata, indent=2))
    {
"extracted_properties": {
"category": "更新",
"mentions": [
"John Doe",
"Jane Smith",
"Michael Johnson",
"Sarah Thompson",
"David Rodriguez",
"Jason Fan"
],
"eli5": "这是一封来自首席执行官Jason Fan的电子邮件,他在其中提供了关于公司不同领域的更新。他谈到了新的安全措施,并赞扬了John Doe在增强我们的网络安全方面的工作。他还提到了新员工,并赞扬了Jane Smith在客户服务方面的出色表现。首席执行官提醒大家即将到来的福利计划的开放报名期限。如果您有任何问题或需要帮助,请联系我们的人力资源代表Michael Johnson(电话:418-492-3850,电子邮件:michael.johnson@example.com)。他还谈到了市场营销团队的工作,并赞扬了Sarah Thompson在管理我们的社交媒体平台方面的杰出努力。Sarah在过去一个月中成功增加了我们的关注者数量20%。此外,请记下7月15日即将举行的产品发布活动。我们鼓励所有团队成员参加并支持我们公司的这个重要里程碑。最后,他谈到了研发项目,并赞扬了David Rodriguez的工作。在7月10日的每月研发头脑风暴会议上,我们希望大家分享他们的想法和对潜在新项目的建议。"
}
}