python and java libraries to parse wikipedia dump dataset

Alternative parsers

From MediaWiki.org
 
Jump to: navigation, search

This page is a compilation of links, descriptions, and status reports of the various alternative MediaWiki parsers—that is, programs and projects, other than MediaWiki itself, which are able or intended to translate MediaWiki's text markup syntax into something else. Some of these have quite narrow purposes, others are possible contenders for replacing the somewhat labyrinthine code that currently drives MediaWiki itself.

Many of the things linked here are likely to be out of date and under-maintained, even abandoned. But in the interest of not duplicating the same work over and over, it seemed sensible to collect together what was "out there".

Known implementations[edit]

Name and link Principal author(s) Language Input Output Comments / other info License
WikiPops.com Max Freedom .NET Wiki title HTML A website that converts Wiki markup to HTML. Allows user to browse for a Wiki title and return the full HTML or an abstract.  
Wiky.php Toni Lähdekorpi PHP, Regular Expressions Markup HTML A tiny PHP library that uses only regular expressions to convert Wiki markup to HTML. Apache License/GPL/LGPL/MPL/CC
sanskritnlp Vishvas Vasuki Scala Mediawiki text Mediawiki text and Section tree Only parses mediawiki sections - that's it. One can parse a wiki page with multiple sections, get a section tree, add, access and delete sections. Creative commons
Wiky Tanin Na Nakorn Ruby Markup HTML A simple Ruby library to convert Wiki markup to HTML. Apache License
Wiky.js Tanin Na Nakorn Javascript Markup HTML A simple Javascript library to convert Wiki markup to HTML (limited subset). Apache License
txtwiki.js Joao Sa Javascript Markup Text A javascript library to convert MediaWiki markup to plaintext. MIT License
wikipedia-js kenshiro_o Node.js Markup HTML A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article. MIT
WikiExtractor Giuseppe Attardi, Antonio Fuschetto Python XML dumps text Simple and fast tool for extracting plain text from Wikipedia dumps. It performs template expansion and handles parser functions (core and extended). GPL
mw2html Connelly Barnes Python Wiki url HTML Minimal setup - gets the basic job of creating a static copy of the wiki done. Public Domain
mwlib PediaPress.com Python with C library Markup and other parse tree, HTML, PDF, XML, OpenDocument Part of cooperation between Wikimedia Foundation and PediaPress. BSD
Mediawiki2HTML Machine Johannes Buchner PHP Markup HTML Project for parsing without the Mediawiki engine. AGPL3 + any later version
PHP5 WP Dan Goldsmith PHP Markup HTML Parser With Plugin Framework To Add Additional Syntax. Configurable for alternative markup i.e. PMWIKI. MPL 2.0
Mylyn WikiText David Green Java Local files HTML, DocBook, Eclipse Help, DITA, extensible Integration with Ant and Eclipse runtime.  
Java API (Bliki engine) axelclk Java Markup fragment HTML, PDF Java Wikipedia API - (supports ParserFunctions, Lua/Scribunto...).  
FlexBisonParse Timwi flex, bison and C Markup fragment Custom XML Intended as an eventual replacement to the parsing code inside MediaWiki itself.  
JAMWiki Ryan Java JAMWiki front-end HTML Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki.  
InstaView Pilaf JavaScript Markup fragment HTML Provides instant preview while editing a page (without reloading).  
InstaView C. Scott Ananian JavaScript Markup fragment HTML Port of Pilaf's code to node.js, volo, and the browser.  
Perl Wikipedia Toolkit Michal Jurosz Perl XML dump, SQL dump Own parse tree, WikiMedia markup Perl Wikipedia Toolkit developed for Computer-assisted Wikipedia translation. (Little functional)  
Text_Wiki_Mediawiki Multiple PHP Markup HTML, Latex, Plain text Part of the Text_Wiki library.  
TomeRaider export Erik Zachte Perl XML dump TomeRaider database See en:Wikipedia:TomeRaider database for more details.  
Waikiki Magnus Manske C++ SQL dump (via SQLite) HTML Abandoned in favour of "flexbisonparse", but has been used inside some experimental "front ends".  
Wikiwyg Jim Higson JavaScript A live installation of MediaWiki HTML (via XML) More than just a parser; attempts to create a fully functional client-side interface.  
wik2dict Guaka Python SQL dump DICT    
wiki2pdf Stephan Walter Python (and PHP) Markup fragment or set of online articles LaTeX, PDF Project is incomplete and dormant.  
wb2pdf Dirk Hünniger Haskell online article LaTeX, PDF, Parse Tree Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia. GPL
WikiPDF Felipe Sanches Python (and PHP) One selected article LaTeX based on templates, PDF Mediawiki extension that uses Stephan Walter's wiki2pdf as backend.  
Wiki2XML Magnus Manske C++ Markup fragment (?) Custom XML Another aborted project on the way to 'flexbisonparse'.  
HTML2FPDF Renato A. C. PHP A PHP class that transforms HTML into a feed for FPDF resulting in a PDF file HTML -> HTML2FPDF -> FPDF -> PDF Not specifically for Mediawiki, but easy to install using an updated version of this tool:updated html2fpdf.php. See HTML2FPDF and Mediawiki for more instructions.  
WikiOnCD Andrew Rodland Perl SQL Dump or markup HTML, Parse tree (eventually?) Started out as an offline wiki browser, but grew a parser when Wiki2static turned out to be too limiting. No web presence yet; code is in the SVN. GPL
WikiTaxi Ralf Junker Delphi / Pascal MediaWiki markup, page or fragment Node-tree, HTML, potentially others Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader. No sources available
Wikifilter  ? C++ (VS) XML dumps HTML A Windows program that uses Apache/IIS to serve the pages. Abandoned in 2006, before ParserFunctions were available.  
Wikipedia Dump Reader Benjamin Thyreau Python XML dumps On screen Cross platform viewer. GPLv2/~BSD license
Marker Ryan Blue ruby Markup (subset) HTML or formatted text Marker is a ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats. GPL
WikiCloth nricciar ruby Markup HTML Ruby implementation of the MediaWiki markup language, including a fair amount of the parser functions. MIT
XWiki XWiki dev team Java Various WikiMarkups Well formed sequence of events, HTML/XHTML, other WikiMarkups XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups. Cant output to mediawiki format as of 2016/03 though. LGPL
Kiwi Thomas Luce, Karl Matthias, AboutUs.org C, Ruby, PEG Markup HTML Kiwi is a PEG-based C implementation with Ruby bindings and a command line parser. It is very fast and supports most of the MediaWiki syntax. Actively developed. BSD
YaCy YaCy dev team Java XML Dump XML with Dublin Core Metadata YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer. GPL
MessageParser Neil Kandalgaonkar JavaScript Markup Abstract syntax tree, jQuery object, HTML Designed for use with message strings, to allow enhanced interface in the browser, like pluralizing internationalized messages or attaching jQuery behavior to links within a message. GPL
Sweble Wikitext Parser Hannes Dohrn Java Markup Abstract syntax tree, XML, HTML Claims to be very thorough. Apache License 2.0
JWPL api Torsten Zesch, Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann Java XML Dump API to access pages, outlinks, inlinks and more "JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." Older parser not maintained any more - JWPL uses Sweble now. LGPL
libmwparser Saitmoh C XML dumps, Markup XML, XHTML, Expanded WikiText Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independent library which supports most of MediaWiki syntax and some extensions like math or gallery. GPL
mediawiki-parser Peter Potrowl
Erik Rose
Python Markup XHTML, raw text, AST GSoC-2011 project; the use of a PEG parser makes it easy to improve.
Parser functions are not supported yet.
GPL
Parsoid Gabriel Wicke and the Parsoid / Visual editor team PEG / JavaScript / Node.js Markup, XML dumps, test cases Tokens, HTML5 DOM with RDFa and round-trip data Fully-featured round-tripping parser/runtime that powers the Visual editor on Wikipedia. Work ongoing to provide a HTML-only read / edit interface, and later to become the default parser for MediaWiki. See roadmap. Used to make this edit. GPL
mwparserfromhell The Earwig Python Markup AST A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies. MIT License
Saya.Parser.Wiki Nana Sakisaka C++ Markup Abstract syntax tree Pure C++11 parser implemented with Boost.Spirit.Qi. Boost Software License 1.0
smc.mw Marcus Brinkmann Python Markup AST, HTML Stateful PEG parser based on Grako, with a very clean separation of parsing stages, grammars and semantic transformations. BSD
Pandoc John MacFarlane Haskell Markup many Can convert subset of mediawiki markup to ~35 different formats (5 of which are flavors of markdown). GPLv2
Wikiforia Marcus Klang Java XML Dumps, Markup Text Uses the AST output from Sweble Wikitext Parser internally to produce raw text. Can parallel decompress and parse compressed multistreamed xml dumps. GPLv2
wtf_wikipedia Spencer Kelly Javascript Markup JSON Supports recursive links & templates, parses infoboxes and links, resolves special templates, parses images and categories. runs server-side & browser.  
Wiki-infobox-parser Zhipeng Jiang JavaScript Markup JSON A light Wikipedia Infobox Parser written in JavaScript. MIT
wikitextparser 5j9 Python Markup AST Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3. GPL
PHP-Wikipedia-Syntax-Parser Don Wilson PHP Markup Associative array Given raw contents and title of a Wikipedia article, this will output highly useful information in an organized fashion.

A non-parser dumper[edit]

One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at https://dumps.wikimedia.org

경축! 아무것도 안하여 에스천사게임즈가 새로운 모습으로 재오픈 하였습니다.
어린이용이며, 설치가 필요없는 브라우저 게임입니다.
https://s1004games.com

 

[출처] https://www.mediawiki.org/wiki/Alternative_parsers

 

본 웹사이트는 광고를 포함하고 있습니다.
광고 클릭에서 발생하는 수익금은 모두 웹사이트 서버의 유지 및 관리, 그리고 기술 콘텐츠 향상을 위해 쓰여집니다.
번호 제목 글쓴이 날짜 조회 수
1195 [ 一日30分 인생승리의 학습법] VBA Web Scraping: How Can VBA Be Used To Scrape Website Data? file 졸리운_곰 2024.04.13 3
1194 [ 一日30分 인생승리의 학습법] 윈도우 실행파일 구조(PE파일) file 졸리운_곰 2024.03.31 3
1193 [ 一日30分 인생승리의 학습법] [Analysis] PE(Portable Executable) 파일 포맷 공부 file 졸리운_곰 2024.03.31 3
1192 [ 一日30分 인생승리의 학습법] 성공하는 메타버스의 3가지 조건 file 졸리운_곰 2024.03.30 7
1191 [ 一日30分 인생승리의 학습법] REST, REST API, RESTful 과 HATEOAS file 졸리운_곰 2024.03.10 9
1190 [ 一日30分 인생승리의 학습법] 렌더링 삼형제 CSR, SSR, SSG 이해하기 file 졸리운_곰 2024.03.10 2
1189 [ 一日30分 인생승리의 학습법] 엑셀 VBA에서 셀레니움 사용을 위한 Selenium Basic 설치 file 졸리운_곰 2024.02.23 11
1188 [ 一日30分 인생승리의 학습법]500 Lines or Less Blockcode: A Visual Programming Toolkit : 500줄 이하의 블록코드: 시각적 프로그래밍 툴킷 졸리운_곰 2024.02.12 4
1187 [ 一日30分 인생승리의 학습법] 구글 클라이언트(앱) 아이디를 발급받으려면 어떻게 해야 하나요? 졸리운_곰 2024.01.28 3
1186 [ 一日30分 인생승리의 학습법] 빅뱅 프로젝트를 성공적으로 오픈하기 위한 팁 졸리운_곰 2023.12.27 16
1185 [ 一日30分 인생승리의 학습법]“빅뱅 전환보다 단계적 전환 방식이 이상적 애자일팀과 협업 쉽게 체질 개선을” file 졸리운_곰 2023.12.27 12
1184 [ 一日30分 인생승리의 학습법] Big-bang / phased 접근 file 졸리운_곰 2023.12.27 3
1183 [ 一日30分 인생승리의 학습법] CodeDragon 메뉴 데이터 전환의 개념 이해 - 데이터 전환의 개념, 데이터 전환방식, 데이터 전환방식 및 장단점 비교, 데이터전환 이후 검토해야 할 사항 졸리운_곰 2023.12.27 5
1182 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 분쟁 해결 시스템 file 졸리운_곰 2023.12.27 6
1181 [ 一日30分 인생승리의 학습법] 블록체인과 IPFS를 이용한 안전한 데이터 공유 플랫폼 - 개념과 리뷰 시스템 file 졸리운_곰 2023.12.27 4
1180 [ 一日30分 인생승리의 학습법] 소켓 CLOSE_WAIT 발생 현상 및 처리 방안 file 졸리운_곰 2023.12.03 7
1179 [ 一日30分 인생승리의 학습법] robots 설정하기 졸리운_곰 2023.12.03 3
1178 [ 一日30分 인생승리의 학습법] A Tutorial and Elementary Trajectory Model for the Differential Steering System of Robot Wheel Actuators : 로봇 휠 액츄에이터의 차동 조향 시스템에 대한 튜토리얼 및 기본 궤적 모델 file 졸리운_곰 2023.11.29 6
1177 [ 一日30分 인생승리의 학습법] Streamline Your MLOps Journey with CodeProject.AI Server : CodeProject.AI 서버로 MLOps 여정을 간소화하세요 file 졸리운_곰 2023.11.25 2
1176 [ 一日30分 인생승리의 학습법] Comparing Self-Hosted AI Servers: A Guide for Developers / : 자체 호스팅 AI 서버 비교: 개발자를 위한 가이드 file 졸리운_곰 2023.11.25 10
대표 김성준 주소 : 경기 용인 분당수지 U타워 등록번호 : 142-07-27414
통신판매업 신고 : 제2012-용인수지-0185호 출판업 신고 : 수지구청 제 123호 개인정보보호최고책임자 : 김성준 sjkim70@stechstar.com
대표전화 : 010-4589-2193 [fax] 02-6280-1294 COPYRIGHT(C) stechstar.com ALL RIGHTS RESERVED