You find the main page for the DCEP resource at:

https://ec.europa.eu/jrc/en/language-technologies>
You will also find there further useful linguistic resources.

DCEP-2013: document-aligned corpora for 24 languages

Example 1: How to download Swedish (SV) source documents in XML/SGML?

wget http://optima.jrc.it/Resources/DCEP-2013/source/DCEP-source-SV-pub.tar.bz2
tar jxf DCEP-source-SV-pub.tar.bz2

The source documents are now in the ./DCEP/source/{xml,sgml}/SV/ subdirectories.

Example 2: How to download document-level alignment information?

wget http://optima.jrc.it/Resources/DCEP-2013/cross-lingual-index.txt.bz2

The alignment information contains correspondences between documents in different languages. Each line lists parallel documents in the different languages. By changing the file suffices, the cross-lingual-index.txt.bz2 can be applied to the source documents, to the stripped documents with the markup removed, and to the documents with sentence segmentation.

Example 3: How to download Danish (DA) and Lithuanian (LT) documents with markup removed?

wget http://optima.jrc.it/Resources/DCEP-2013/strip/DCEP-strip-DA-pub.tar.bz2
wget http://optima.jrc.it/Resources/DCEP-2013/strip/DCEP-strip-LT-pub.tar.bz2
tar jxf DCEP-strip-DA-pub.tar.bz2
tar jxf DCEP-strip-LT-pub.tar.bz2

The documents with markup removed are now in the ./DCEP/strip/{xml,sgml}/{DA,LT}/ subdirectories.

Example 4: How to download Danish (DA) and Lithuanian (LT) sentence-segmented documents with markup removed?

wget http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-DA-pub.tar.bz2
wget http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-LT-pub.tar.bz2
tar jxf DCEP-sentence-DA-pub.tar.bz2
tar jxf DCEP-sentence-LT-pub.tar.bz2

The sentence segmented documents are now in the ./DCEP/sentence/{xml,sgml}/{DA,LT}/ subdirectories.

Example 5: How the markup is removed from the source documents?

The documents with the markup removed are already available for download. This example only demonstrates the process with some of the Swedish documents.

Prerequisites: Execute the commands in Example 1.

wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-strip-scripts.tar.bz2
tar jxf DCEP-strip-scripts.tar.bz2

You need to have Perl and a command line XSLT processor installed to run the scripts. Remove markup and insert newlines to all Swedish XML document for scenario TA:

# Generate all document-type specific scenarios and XSLT
./strip-scripts/xslt-gen-nogz
# Process TA documents using the specific scenario
find DCEP/source/xml/SV/TA -type f -print0 | xargs -0 -n1 -P 8 ./strip-scripts/gen-scripts/xml/TA/scenario

The documents with markup removed are now located under the directories ./DCEP/strip/{xml,sgml}/SV/. This simple example processed only documents with the scenario TA. N.B. The scripts inserts line breaks, for example, to break elements in a list. See the file strip-scripts/README for more information.

Example 6: How the documents are sentence-segmented?

The documents with the sentence segmented are already available for download. This example only demonstrates the process with some of the Swedish documents. If you wish to use your own sentence segmentation script, please keep in mind that the line breaks generated by the markup removal signify sentence breaks but there can be multiple sentences per line.

Prerequisites: Execute the commands in Example 1 and 5.

wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-sentence-scripts.tar.bz2
tar jxf DCEP-sentence-scripts.tar.bz2

You need to have Perl installed to run the scripts. Apply the sentence-segmentation script for each file:

for infile in `find DCEP/strip/ -type f -regex "^DCEP/strip/\(xml\|sgml\)/SV/.*"`; do 
    # Compute output path
    outfile=`echo $infile | sed 's,^DCEP/strip/,DCEP/sentence/,'`
    # Create necessary directories
    mkdir -p $(dirname $outfile)
    # Do sentence segmentation 
    perl sentence-scripts/split-sentences-DCEP.pl -l sv <$infile >$outfile 2>/dev/null
done

The sentence-segmented files are now under the directories ./DCEP/sentence/{xml,sgml}/SV/. This simple example can be slow because each file is processed individually. The script sentence-scripts/convert_all.sh uses a faster method that runs only one process per language for the sentence segmentation script.

Example 7: How to extract unaligned bilingual parallel documents?

Prerequisites: Execute the commands in Example 2 and 4.

wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-index-scripts.tar.bz2
tar jxf DCEP-index-scripts.tar.bz2

You need to have Perl installed to run the script. Extract parallel documents using the multilingual document index from the sentence-segmented documents:

bzcat cross-lingual-index.txt.bz2 | perl index-scripts/DCEPtobicorpus.pl DA LT DCEP/sentence unaligned-sentences

The output is two UTF-8 text files. The information about the source documents is lost in this output format. N.B. The sentences are not aligned.

Example 8: How to generate a sentence-aligned bilingual corpus?

The downloadable sentence-alignments were produced with HunAlign. The details are not available at this time.