https://ec.europa.eu/jrc/en/language-technologies>You will also find there further useful linguistic resources.
wget http://optima.jrc.it/Resources/DCEP-2013/source/DCEP-source-SV-pub.tar.bz2
tar jxf DCEP-source-SV-pub.tar.bz2
The source documents are now in the ./DCEP/source/{xml,sgml}/SV/
subdirectories.
wget http://optima.jrc.it/Resources/DCEP-2013/cross-lingual-index.txt.bz2
The alignment information contains correspondences between documents
in different languages. Each line lists parallel documents in the
different languages. By changing the file suffices, the
cross-lingual-index.txt.bz2
can be applied to the source documents,
to the stripped documents with the markup removed, and to the
documents with sentence segmentation.
wget http://optima.jrc.it/Resources/DCEP-2013/strip/DCEP-strip-DA-pub.tar.bz2
wget http://optima.jrc.it/Resources/DCEP-2013/strip/DCEP-strip-LT-pub.tar.bz2
tar jxf DCEP-strip-DA-pub.tar.bz2
tar jxf DCEP-strip-LT-pub.tar.bz2
The documents with markup removed are now in the ./DCEP/strip/{xml,sgml}/{DA,LT}/
subdirectories.
wget http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-DA-pub.tar.bz2
wget http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-LT-pub.tar.bz2
tar jxf DCEP-sentence-DA-pub.tar.bz2
tar jxf DCEP-sentence-LT-pub.tar.bz2
The sentence segmented documents are now in the ./DCEP/sentence/{xml,sgml}/{DA,LT}/
subdirectories.
The documents with the markup removed are already available for download. This example only demonstrates the process with some of the Swedish documents.
Prerequisites: Execute the commands in Example 1.
wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-strip-scripts.tar.bz2
tar jxf DCEP-strip-scripts.tar.bz2
You need to have Perl and a command line XSLT processor installed to run the scripts. Remove markup and insert newlines to all Swedish XML document for scenario TA:
# Generate all document-type specific scenarios and XSLT
./strip-scripts/xslt-gen-nogz
# Process TA documents using the specific scenario
find DCEP/source/xml/SV/TA -type f -print0 | xargs -0 -n1 -P 8 ./strip-scripts/gen-scripts/xml/TA/scenario
The documents with markup removed are now located under the directories
./DCEP/strip/{xml,sgml}/SV/
. This simple example processed only
documents with the scenario TA. N.B. The scripts inserts line breaks,
for example, to break elements in a list. See the file
strip-scripts/README
for more information.
The documents with the sentence segmented are already available for download. This example only demonstrates the process with some of the Swedish documents. If you wish to use your own sentence segmentation script, please keep in mind that the line breaks generated by the markup removal signify sentence breaks but there can be multiple sentences per line.
Prerequisites: Execute the commands in Example 1 and 5.
wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-sentence-scripts.tar.bz2
tar jxf DCEP-sentence-scripts.tar.bz2
You need to have Perl installed to run the scripts. Apply the sentence-segmentation script for each file:
for infile in `find DCEP/strip/ -type f -regex "^DCEP/strip/\(xml\|sgml\)/SV/.*"`; do
# Compute output path
outfile=`echo $infile | sed 's,^DCEP/strip/,DCEP/sentence/,'`
# Create necessary directories
mkdir -p $(dirname $outfile)
# Do sentence segmentation
perl sentence-scripts/split-sentences-DCEP.pl -l sv <$infile >$outfile 2>/dev/null
done
The sentence-segmented files are now under the directories
./DCEP/sentence/{xml,sgml}/SV/
. This simple example can be slow
because each file is processed individually. The script
sentence-scripts/convert_all.sh
uses a faster method that runs only
one process per language for the sentence segmentation script.
Prerequisites: Execute the commands in Example 2 and 4.
wget http://optima.jrc.it/Resources/DCEP-2013/DCEP-index-scripts.tar.bz2
tar jxf DCEP-index-scripts.tar.bz2
You need to have Perl installed to run the script. Extract parallel documents using the multilingual document index from the sentence-segmented documents:
bzcat cross-lingual-index.txt.bz2 | perl index-scripts/DCEPtobicorpus.pl DA LT DCEP/sentence unaligned-sentences
The output is two UTF-8 text files. The information about the source documents is lost in this output format. N.B. The sentences are not aligned.
The downloadable sentence-alignments were produced with HunAlign. The details are not available at this time.