Wikipedia as text

KOPI - The best choice for searching translated plagiarism

KOPI is a unique service that is able to identify quotations - and also their translations - taken from the English Wikipedia. KOPI can run searches in Hungarian, English and German, and is being continuously improved to support other languages as well.

When seeking information on the web Wikipedia is an essential source. The English version features nearly 4 million articles. Studies show that it is also the number one source of plagiarism, so when we created our new translational plagiarism checker, we looked for a way to add this vast source of information to our database. We found that it is impossible to download the whole database in an easy to handle format (like HTML or plain text) and that all the available Mediawiki converters had some flaws. So we have written a Mediawiki XML dump to plain text converter, which we run every time a new database dump appears on the site and publish the text version for everybody to use.

Please see the KOPI portal for more information.

Licence

The files in these torrents are derived from Wikipedia content. As such, they are distributed under the following license:
Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0), The full version of the license can be accessed from here.

We would appreciate it if you could refer to this article when using our corpus:
Pataki, M., Vajna, M. and Marosi, A. Wikipedia as Text. Ercim News - Special theme: Big Data. 2012, Vols. 89, pp. 48-49.

Download Wikipedia Text Dumps

The text version is created accourding to these principles:

Here you can download wikipedia as text, we use torrents to spare our resources; if you can please seed after downloading.



Wikipedia as Text by Máté PATAKI, Miklós VAJNA, Attila Csaba MAROSI (c) MTA SZTAKI Department of Distributed Systems