Skip to main content

MediaWikiDump

MediaWiki XML Dumps包含了维基的内容(维基页面及其所有修订版本),但不包含与站点相关的数据。XML转储不会创建维基数据库的完整备份,转储文件中不包含用户帐户、图像、编辑日志等。

这里介绍了如何将MediaWiki XML转储文件加载到我们可以在下游使用的文档格式中。

它使用mediawiki-utilities中的mwxml进行转储,使用earwig中的mwparserfromhell解析MediaWiki的wikicode。

可以使用dumpBackup.php或Wiki的Special:Statistics页面获取转储文件。

# mediawiki-utilities支持未合并分支中的XML模式0.11
pip install -qU git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11
# mediawiki-utilities mwxml存在错误,修复PR待处理
pip install -qU git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11
pip install -qU mwparserfromhell
from langchain.document_loaders import MWDumpLoader
loader = MWDumpLoader(
file_path = "example_data/testmw_pages_current.xml",
encoding="utf8",
#namespaces = [0,2,3] 可选列表,仅加载特定命名空间。默认加载所有命名空间。
skip_redirects = True, #将跳过仅重定向到其他页面的页面(如果为False,则不跳过)
stop_on_error = False #将跳过导致解析错误的页面(如果为False,则不跳过)
)
documents = loader.load()
print(f"您的数据中有{len(documents)}个文档")
    您的数据中有177个文档
documents[:5]
    [Document(page_content='\t\n\t\n\tArtist\n\tReleased\n\tRecorded\n\tLength\n\tLabel\n\tProducer', metadata={'source': 'Album'}),
Document(page_content='{| class="article-table plainlinks" style="width:100%;"\n|- style="font-size:18px;"\n! style="padding:0px;" | Template documentation\n|-\n| Note: portions of the template sample may not be visible without values provided.\n|-\n| View or edit this documentation. (About template documentation)\n|-\n| Editors can experiment in this template\'s [ sandbox] and [ test case] pages.\n|}Category:Documentation templates', metadata={'source': 'Documentation'}),
Document(page_content='Description\nThis template is used to insert descriptions on template pages.\n\nSyntax\nAdd <noinclude></noinclude> at the end of the template page.\n\nAdd <noinclude></noinclude> to transclude an alternative page from the /doc subpage.\n\nUsage\n\nOn the Template page\nThis is the normal format when used:\n\nTEMPLATE CODE\n<includeonly>Any categories to be inserted into articles by the template</includeonly>\n<noinclude>{{Documentation}}</noinclude>\n\nIf your template is not a completed div or table, you may need to close the tags just before {{Documentation}} is inserted (within the noinclude tags).\n\nA line break right before {{Documentation}} can also be useful as it helps prevent the documentation template "running into" previous code.\n\nOn the documentation page\nThe documentation page is usually located on the /doc subpage for a template, but a different page can be specified with the first parameter of the template (see Syntax).\n\nNormally, you will want to write something like the following on the documentation page:\n\n==Description==\nThis template is used to do something.\n\n==Syntax==\nType <code>{{t|templatename}}</code> somewhere.\n\n==Samples==\n<code><nowiki>{{templatename|input}}</nowiki></code> \n\nresults in...\n\n{{templatename|input}}\n\n<includeonly>Any categories for the template itself</includeonly>\n<noinclude>[[Category:Template documentation]]</noinclude>\n\nUse any or all of the above description/syntax/sample output sections. You may also want to add "see also" or other sections.\n\nNote that the above example also uses the Template:T template.\n\nCategory:Documentation templatesCategory:Template documentation', metadata={'source': 'Documentation/doc'}),
Document(page_content='Description\nA template link with a variable number of parameters (0-20).\n\nSyntax\n \n\nSource\nImproved version not needing t/piece subtemplate developed on Templates wiki see the list of authors. Copied here via CC-By-SA 3.0 license.\n\nExample\n\nCategory:General wiki templates\nCategory:Template documentation', metadata={'source': 'T/doc'}),
Document(page_content='\t\n\t\t \n\t\n\t\t Aliases\n\t Relatives\n\t Affiliation\n Occupation\n \n Biographical information\n Marital status\n \tDate of birth\n Place of birth\n Date of death\n Place of death\n \n Physical description\n Species\n Gender\n Height\n Weight\n Eye color\n\t\n Appearances\n Portrayed by\n Appears in\n Debut\n ', metadata={'source': 'Character'})]