According to an agreement with the European Commission's Publications Office OPOCE, the AC corpus can be used and distributed for research purposes, but the following usage conditions must be adhered to:
The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series as well as charters and treaties and ECJ case-law to be in the public domain. Prior written permission is thus not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgment is given to the European Communities and to the source, and provided that the additional guidelines set out below are respected.
(1) Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read:
'Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.'
(2) For the reasons stated in the disclaimer above, it is advisable to ensure that translations are made from the printed, authentic version of the Official Journal. This precaution, while minimizing the risk of error, does not confer any legal status whatsoever to the translated text. The following notice shall accompany the translated text, printed below the acknowledgment:
'Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities. Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder].'
Moreover, please note that we do not consider a "further commercial dissemination" the inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/thesis/studies/reports/books issued by third-party authors or publishers, whatever the means, and disseminated subject to payment.
When using the JRC-Acquis, please acknowledge this in your publications, making reference to the paper mentioned below. Ralf Steinberger , Bruno Pouliquen , Anna Widiger , Camelia Ignat , Tomaž Erjavec , Dan Tufiş , and Dániel Varga The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC'06, Genoa, Italy, 2006. Please check http://langtech.jrc.it//index.html#Publications for more recent publications on the subject.
Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.
The CELEX Web site was queried for all CELEX descriptors, for all the languages; note that not all documents with a given CELEX number are available for all the languages. For a given Celex Code and a given language, the query URL has been for most documents: http://europa.eu.int/smartapi/cgi/sga_doc?smartapi%21celexplus%21prod%21CELEXnumdoc&numdoc=CELEXCODE&lg=LG where the two parameters CELEXCODE and LG are to be replaced by their respective values.
The downloaded documents were then automatically checked for the language, because not all documents were actually in the expected language.
In the final corpus were included only those documents that are available in ten or more languages of which at least 3 are one of out of the group Czech, Estonian, Hungarian, Lithuanian, Latvian, Maltese, Polish, Slovak, Slovene.
No correction was performed on the text.
After HTML document download, the texts were converted to XML. A specific character encoding conversion tried to keep Greek characters; no other character conversions were performed. The title and body of each text were isolated, the paragraph breaks (<P> HTML tags) were kept.
A list of manually written rules detected the beginning of the annexes and signatures of texts. These rules were used to distinguish between the text body, the signature of the text and the annexes. They were manually written by developers who do not speak the 22 languages. It is likely that some of the signatures and annexes were missed, and others may have been recognized wrongly.
Except for errors, no end-of-line hyphenation present in texts.
The two-letter language codes follow ISO 639 and are defined in the language usage element.
All documents have a numerical identifier called the CELEX code (see http://europa.eu.int/celex/). This code identifies the same text in the various languages.
Further information on the corpus compilation steps can be found in: Ralf Steinberger , Bruno Pouliquen , Anna Widiger , Camelia Ignat , Tomaž Erjavec , Dan Tufiş , and Dániel Varga The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC'06, Genoa, Italy, 2006. Available at http://langtech.jrc.it//index.html#Publications.