Skip to main content

URL (网址)

这部分介绍了如何从URL列表中加载HTML文档,并将其转换为我们可以在下游使用的文档格式。

from langchain.document_loaders import UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

在headers=headers中传入ssl_verify=False以解决ssl_verification错误。

loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Selenium URL Loader (Selenium网址加载器)

这部分介绍了如何使用SeleniumURLLoader从URL列表中加载HTML文档。

使用selenium可以加载需要JavaScript渲染的页面。

设置

要使用SeleniumURLLoader,您需要安装seleniumunstructured

from langchain.document_loaders import SeleniumURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()

Playwright URL Loader (Playwright网址加载器)

这部分介绍了如何使用PlaywrightURLLoader从URL列表中加载HTML文档。

与Selenium情况类似,Playwright允许我们加载需要JavaScript渲染的页面。

设置

要使用PlaywrightURLLoader,您需要安装playwrightunstructured。此外,您还需要安装Playwright Chromium浏览器:

# 安装playwright
pip install "playwright"
pip install "unstructured"
playwright install
from langchain.document_loaders import PlaywrightURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()