Menu

Main Page

Anonymous Tomaz Solc

To edit this Wiki, please create an account on SourceForge and log in with the link on the upper-right

What is Wikiprep

Wikiprep is short for Wikipedia preprocessor and information extractor.

It is a Perl script that parses MediaWiki data dumps in XML format and extracts useful information from them. It implements a subset of MediaWiki syntax (such as template inclusion with parameters, internal and external links, headings, redirects, etc.). Output is in the form of several files: some of them in simple, line-oriented format and some of them in XML. One of the files also contains processed Wikipedia pages in a simple HTML-like syntax.

The goal of Wikiprep is to convert Wikipedia data dumps into a format that can be easily processed with other tools. These tools then do not need to have the full knowledge of all quirks and odd corners of MediaWiki syntax.

Wikiprep was initially developed by Evgeniy Gabrilovich.

Available versions

There are two distinct versions available:

Chris Jordan's version

Currently (march 2009) This version is incapable of processing the latest Wikipedia dumps provided by Wikimedia)

This is the original Wikiprep code with some minor modifications.

Chris maintains the CVS repository on SourceForge.net. You can get his code by following these instructions.

Zemanta's Wikiprep

This is a version of Wikiprep that is used by Zemanta for extracting semantic information from Wikipedia. It's based on the original Wikiprep, but is heavily modified and extracts different information from dumps as Chris' version.

This version is currently maintained by Tomaž Šolc. It is kept up-to-date to support the latest Wikipedia dumps. Usually any problems are resolved within a week after a new English Wikipedia dump becomes available.

You can get stable releases of this branch from SourceForge (look for 3.xx releases):

http://sourceforge.net/projects/wikiprep/files

This version of Wikiprep is also available from a public git repository:

$ git clone http://www.tablix.org/~avian/git/wikiprep.git

Refer to the README file for further instructions.

Documentation

Mailing list

We have a mailing list for announcements and general discussion. This is the best place to ask questions or send bug reports and feature requests.


Related

Wiki: File formats
Wiki: Frequently asked questions
Wiki: Getting started