The Romanian side of the multilingual corpus JRC-Acquis, was XML encoded at RACAI, based on the MS-Word documents downloaded from the CCVista Translation Database at http://ccvista.taiex.be/Fulcrum/CCVista/RO/. Out of the 19286 files available at CCVista there were succesfully converted 19211 files (the rest contained format errors). The texts were normalised for the diacritical signs. All the documents before 2006 used for the two diacritical characters (º and þ) the old codes corresponding to the ş and ţ SGML entities. The MS-Word documents after 2006 used for the same characters the new codes corresponding to the &scomma; and &tcomma; entities. The normalized files used UTF codes corresponding to the ş and ţ SGML entities. While in Vista both codes are visible under other operating systems the new codes are illegible. For further details please see the discussion at http://en.wikipedia.org/wiki/Romanian_alphabet#ISO_8859. The translators' comments, page numbers, headers, footers, end notes and the footnotes were removed (as well as the superscript references to them) but not the annexes and signatures. The 19211 documents contain 182,631,277 characters and 30,832,212 words. Out of the total number of Romanian documents (19,211), 11,469 are common with the English documents (they have the same CELEX code). For this new version of the Romanian part of the JRC-Acquis, Vanilla or HunAglign alignment information towards JRC-Acquis version 3 is not available. Dan Tufis, 17 February 2009