Skip to main content

Arxiv

arXiv 是一个开放获取的学术文章存档,包含了物理学、数学、计算机科学、定量生物学、定量金融、统计学、电气工程与系统科学以及经济学等领域的200万篇学术文章。

本笔记本展示了如何从 Arxiv.org 检索科学文章并将其转换为下游使用的文档格式。

安装

首先,您需要安装 arxiv Python 包。

#!pip install arxiv

ArxivRetriever 有以下参数:

  • 可选参数 load_max_docs:默认值为100。用于限制下载的文档数量。下载所有100个文档需要一些时间,因此在实验中使用较小的数字。目前有一个硬限制为300。
  • 可选参数 load_all_available_meta:默认值为False。默认情况下,只下载最重要的字段:Published(文档发布/最后更新日期)、Title(标题)、Authors(作者)、Summary(摘要)。如果设置为True,则还会下载其他字段。

get_relevant_documents() 有一个参数 query,用于在 Arxiv.org 中查找文档的自由文本。

示例

运行检索器

from langchain.retrievers import ArxivRetriever
retriever = ArxivRetriever(load_max_docs=2)
docs = retriever.get_relevant_documents(query="1605.08386")
docs[0].metadata  # 文档的元信息
    {'Published': '2016-05-26',
'Title': 'Heat-bath random walks with Markov bases',
'Authors': 'Caprice Stanley, Tobias Windisch',
'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\ndimension.'}
docs[0].page_content[:400]  # 文档的内容
    'arXiv:1605.08386v1  [math.CO]  26 May 2016\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES\nCAPRICE STANLEY AND TOBIAS WINDISCH\nAbstract. Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on fibers of a\nfixed integer matrix can be bounded from above by a constant. We then study the mixing\nbehaviour of heat-b'

问题回答

# 获取一个令牌:https://platform.openai.com/account/api-keys

from getpass import getpass

OPENAI_API_KEY = getpass()
     ········
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name="gpt-3.5-turbo") # 切换到 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
questions = [
"What are Heat-bath random walks with Markov base?",
"What is the ImageBind model?",
"How does Compositional Reasoning with Large Language Models works?",
]
chat_history = []

for question in questions:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result["answer"]))
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
    -> **Question**: What are Heat-bath random walks with Markov base? 

**Answer**: I'm not sure, as I don't have enough context to provide a definitive answer. The term "Heat-bath random walks with Markov base" is not mentioned in the given text. Could you provide more information or context about where you encountered this term?

-> **Question**: What is the ImageBind model?

**Answer**: ImageBind is an approach developed by Facebook AI Research to learn a joint embedding across six different modalities, including images, text, audio, depth, thermal, and IMU data. The approach uses the binding property of images to align each modality's embedding to image embeddings and achieve an emergent alignment across all modalities. This enables novel multimodal capabilities, including cross-modal retrieval, embedding-space arithmetic, and audio-to-image generation, among others. The approach sets a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Additionally, it shows strong few-shot recognition results and serves as a new way to evaluate vision models for visual and non-visual tasks.

-> **Question**: How does Compositional Reasoning with Large Language Models works?

**Answer**: Compositional reasoning with large language models refers to the ability of these models to correctly identify and represent complex concepts by breaking them down into smaller, more basic parts and combining them in a structured way. This involves understanding the syntax and semantics of language and using that understanding to build up more complex meanings from simpler ones.

In the context of the paper "Does CLIP Bind Concepts? Probing Compositionality in Large Image Models", the authors focus specifically on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way. They examine CLIP's ability to compose concepts in a single-object setting, as well as in situations where concept binding is needed.

The authors situate their work within the tradition of research on compositional distributional semantics models (CDSMs), which seek to bridge the gap between distributional models and formal semantics by building architectures which operate over vectors yet still obey traditional theories of linguistic composition. They compare the performance of CLIP with several architectures from research on CDSMs to evaluate its ability to encode and reason about compositional concepts.

questions = [
"What are Heat-bath random walks with Markov base? Include references to answer.",
]
chat_history = []

for question in questions:
result = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, result["answer"]))
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
    -> **Question**: What are Heat-bath random walks with Markov base? Include references to answer. 

**Answer**: Heat-bath random walks with Markov base (HB-MB) 是一类随机过程,已在统计力学和凝聚态物理领域进行了研究。在这些过程中,粒子通过转移到相邻的位置来在晶格中移动,转移的选择根据粒子的能量和周围环境的能量的概率分布进行。

HB-MB 过程是由 Bortz、Kalos 和 Lebowitz 在 1975 年引入的,用于模拟晶格中相互作用粒子的热平衡动力学。该方法已被用于研究各种物理现象,包括相变、临界行为和输运性质。

参考文献:

Bortz, A. B., Kalos, M. H., & Lebowitz, J. L. (1975). A new algorithm for Monte Carlo simulation of Ising spin systems. Journal of Computational Physics, 17(1), 10-18.

Binder, K., & Heermann, D. W. (2010). Monte Carlo simulation in statistical physics: an introduction. Springer Science & Business Media.