Skip to main content

Recursive URL Loader

We may want to process load all URLs under a root directory. 我们可能希望处理加载根目录下的所有URL。

For example, let's look at the Python 3.9 Document. 例如,让我们来看看Python 3.9文档

This has many interesting child pages that we may want to read in bulk. 这个文档有很多有趣的子页面,我们可能希望批量阅读。

Of course, the WebBaseLoader can load a list of pages. 当然,WebBaseLoader可以加载页面列表。

But, the challenge is traversing the tree of child pages and actually assembling that list! 但是,挑战在于遍历子页面的树,并实际组装该列表!

We do this using the RecursiveUrlLoader. 我们使用RecursiveUrlLoader来实现这一点。

This also gives us the flexibility to exclude some children, customize the extractor, and more. 这还使我们能够灵活地排除一些子页面、自定义提取器等等。

参数 (Parameters)

  • url: str, the target url to crawl.
  • exclude_dirs: Optional[str], webpage directories to exclude.
  • use_async: Optional[bool], wether to use async requests, using async requests is usually faster in large tasks. However, async will disable the lazy loading feature(the function still works, but it is not lazy). By default, it is set to False.
  • extractor: Optional[Callable[[str], str]], a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like goose3 and beautifulsoup to extract the text. By default, it just returns the page as it is.
  • max_depth: Optional[int] = None, the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.
  • timeout: Optional[int] = None, the timeout for each request, in the unit of seconds. By default, it is set to 10.
  • prevent_outside: Optional[bool] = None, whether to prevent crawling outside the root url. By default, it is set to True.
from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader

Let's try a simple example.

让我们尝试一个简单的例子。

from bs4 import BeautifulSoup as Soup

url = "https://docs.python.org/3.9/"
loader = RecursiveUrlLoader(url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()
docs[0].page_content[:50]
    '\n\n\n\n\nPython Frequently Asked Questions — Python 3.'
docs[-1].metadata
    {'source': 'https://docs.python.org/3.9/library/index.html',
'title': 'The Python Standard Library — Python 3.9.17 documentation',
'language': None}

However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough.

然而,由于很难进行完美的过滤,你可能仍然会在结果中看到一些不相关的结果。如果需要,你可以自己对返回的文档进行过滤。大多数情况下,返回的结果已经足够好了。

Testing on LangChain docs.

在LangChain文档上进行测试。

url = "https://js.langchain.com/docs/modules/memory/integrations/"
loader = RecursiveUrlLoader(url=url)
docs = loader.load()
len(docs)
    8