Skip to main content

Doctran 构建查询文档 (Doctran Interrogate Documents)

在向量存储知识库中使用的文档通常以叙述或会话格式存储。然而,大多数用户查询都是以问题格式提出的。如果我们在将文档向量化之前将其转换为Q&A格式,我们可以增加检索相关文档的可能性,并减少检索不相关文档的可能性。

我们可以使用Doctran库来实现这一点,它使用OpenAI的函数调用功能来“审问”文档。

请查看这个笔记本,了解基于原始文档与审问后文档的各种查询的向量相似性分数的基准测试。

pip install doctran
import json
from langchain.schema import Document
from langchain.document_transformers import DoctranQATransformer
from dotenv import load_dotenv

load_dotenv()

输入 (Input)

这是我们将要查询的文档:

sample_text = """[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Employee Benefits
Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Research and Development Projects
In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.

Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev
"""
print(sample_text)
documents = [Document(page_content=sample_text)]
qa_transformer = DoctranQATransformer()
transformed_document = await qa_transformer.atransform_documents(documents)

输出 (Output)

在对文档进行查询后,结果将作为包含问题和答案的元数据返回为新文档。

transformed_document = await qa_transformer.atransform_documents(documents)
print(json.dumps(transformed_document[0].metadata, indent=2))
    {
"questions_and_answers": [
{
"question": "What is the purpose of this document?",
"answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
},
{
"question": "Who is responsible for enhancing the network security?",
"answer": "John Doe from the IT department is responsible for enhancing the network security."
},
{
"question": "Where should potential security risks or incidents be reported?",
"answer": "Potential security risks or incidents should be reported to the dedicated team at security@example.com."
},
{
"question": "Who has been recognized for outstanding performance in customer service?",
"answer": "Jane Smith has been recognized for her outstanding performance in customer service."
},
{
"question": "When is the open enrollment period for the employee benefits program?",
"answer": "The document does not specify the exact dates for the open enrollment period for the employee benefits program, but it mentions that it is fast approaching."
},
{
"question": "Who should be contacted for questions or assistance regarding the employee benefits program?",
"answer": "For questions or assistance regarding the employee benefits program, the HR representative, Michael Johnson, should be contacted."
},
{
"question": "Who has been acknowledged for managing the company's social media platforms?",
"answer": "Sarah Thompson has been acknowledged for managing the company's social media platforms."
},
{
"question": "When is the upcoming product launch event?",
"answer": "The upcoming product launch event is on July 15th."
},
{
"question": "Who has been recognized for their contributions to the development of the company's technology?",
"answer": "David Rodriguez has been recognized for his contributions to the development of the company's technology."
},
{
"question": "When is the monthly R&D brainstorming session?",
"answer": "The monthly R&D brainstorming session is scheduled for July 10th."
},
{
"question": "Who should be contacted for questions or concerns regarding the topics discussed in the document?",
"answer": "For questions or concerns regarding the topics discussed in the document, Jason Fan, the Cofounder & CEO, should be contacted."
}
]
}