Hosted by SourceForge.net (project page)
Wikiprep is a Perl script that parses MediaWiki data dumps in XML format and extracts useful information from them. It implements a subset of MediaWiki syntax (such as template inclusion with parameters, internal and external links, headings, redirects, etc.). Output is in the form of several files: some of them in simple, line-oriented format and some of them in XML. One of the files also contains processed Wikipedia pages in a simple HTML-like syntax.
The goal of Wikiprep is to convert Wikipedia data dumps into a format that can be easily processed with other tools. These tools then do not need to have the full knowledge of all quirks and odd corners of MediaWiki syntax.
Wikiprep was initially developed by Evgeniy Gabrilovich.
We have a mailing list for announcements and general discussion. This is the best place to ask questions or send bug reports and feature requests.
CVS Head branch contains the original Wikiprep code with minor modifications. It is currently maintained by Chris Jordan. This is the version you get if you do a CVS checkout without any special options.
You can get more information about this version from the original Wikiprep page.
This is a version of Wikiprep that is used by Zemanta for extracting semantic information from Wikipedia. It's based on the original Wikiprep, but is heavily modified and extracts different information from dumps as the Head branch. This branch is currently maintained by Tomaž Šolc
You can get this version of Wikiprep from a git repository at the following URL (the old TOMAZ_1 branch at SourceForge CVS is no longer updated):
http://code.zemanta.com/tsolc/git/wikiprep
The simplest way is to use a command like this:
$ git clone http://code.zemanta.com/tsolc/git/wikiprep
Refer to the README file for further instructions.
Note that this version of Wikiprep is actively developed and its features can change significantly over time. As of July 2008 it offers significantly lower CPU time and memory requirements over the original. If you plan to use it for anything serious it is best to subscribe to the mailing list and get in contact with the maintainer.
The information currently extracted: