Copyright © 2013 Alexander Kobzar (sashakobzar@gmail.com) GETALP Grenoble Informatics Laboratory France INTRODUCTION ============ This manual shows how to set up a human evaluation experiment of translation quality of phrasal verbs (PV) in split word order (English-French). The scripts described below can be used for preparing experiments for any other types of multi-word expressions (MWE) as well. Each script folder contains test data used in the examples below and a README file describing each script in detail. If you are only going to employ automatic metrics (BLEU, METEOR etc.) for your evaluation purposes, you can use this manual just to prepare a corpus to build and test a machine translation (MT) system. In order to set up an experiment, you will need three scripts: csv2text.exe (optional), extractsent.exe, and mweblast.exe. There are also two additional scripts, filterblast.exe and mixblast.exe, which can improve your evaluation experiment. You are supposed to use mwetoolkit (http://mwetoolkit.sourceforge.net) to identify MWEs, Moses (www.statmt.org/moses) to build an MT system, and Blast (www.ida.liu.se/~sarst/blast or http://cameleon.imag.fr/xwiki/bin/view/Main/Phrasal_verbs_annotation) to conduct your evaluation experiment. PREREQUISITES ============= All scripts are written in C# and are compatible with .NET Framework 3.5 (csv2text with 2.0) and later. Windows ------- The script can be run on Windows XP and later. Version 3.5 of .NET Framework is included with all versions of Windows starting from Windows 7. If you have an older version of the framework, you can download a newer one at www.microsoft.com/net/downloads Linux and Mac OS X ------------------ Mono is an implementation of .NET Framework for Linux and Mac OS. On Ubuntu/Debian, you can install it with . If you use another Linux distribution, Mac OS, or there is a version mismatch, you should download the latest Mono version for you distribution at http://www.go-mono.com/mono-downloads/download.html. Make sure you use the latest version available for your operating system. 1. TOKENIZATION AND EXTRACTIION OF CANDIDATES ============================================= To extract PVs with mwetoolkit, you should preprocess your source text with either TreeTagger or Rasp. They tokenize a text in a different way than Moses tokenizer does. In order to preserve correspondence between positions of MWEs identified by mwetoolkit and a tokenized text, you should use a tokenized source text processed by mwetoolkit to build an MT system. Therefore, you should not tokenize the source text with Moses tokenizer. However, the target text should be tokenized with Moses. Then you have to generate an XML corpus file from a parser's output. The whole process is described on the mwetoolkit website. In order to extract a plain tokenized text from an XML corpus file, you should first obtain a csv files with xml2csv.py located in the bin folder of mwetoolkit. The tokenized text is in the second column. In linux, you can extract it with the following command: awk -F"\\t" '{print $2}' corpus.en.csv > corpus.en You can also use the csv2text script to do it: [mono] csv2text.exe corpus.en.csv corpus.en corpus.en contains a tokenized corpus in plain text. Then, you should extract PVs with mwetoolkit. You will use candidates.py to generate an XML file containing the information about PVs in the source text. You have to use the -S option to include, for each PV, its sentence ID and individual word positions in the file. 2. SPLITTING THE CORPUS ======================= The next step is to extract and translate sentences containing PVs. To build a corresponding MT system, you need to divide your corpus into the train and test sets. The former will contain sentences without PV in split word order and the latter with them. You can do it with the splitcorpus script. When you translate your test set with Moses, you should use the -alignment-output-file option of the decoder, which writes the word-to-word alignment information to a file. It allows finding corresponding translations in the target text. In order to better train your MT system, you have to extract a part of the test corpus (usually a half) and add it to the train one since the former does not contain sentence with PV in split word order. However, you should do it manually. To set up an evaluation experiment, you should use the -blast option. If you want to extract a development set from your corpus, then you should remove the same number of sentences (lines) from the beginning of both source and target texts and use the -dev option specifying the number of lines you deleted. [mono] splitcorpus.exe corpus.en corpus.fr candidates.xml -moses -blast -dev 2000 This will create files needed to train an MT system, test corpus and reference translation: train.corpus.en, train.corpus.fr, test.corpus.en, test.corpus.fr. It will also create mwe2blast.txt needed by mwe2blast to create a Blast file. 3. SETTING UP THE EXPERIMENT ============================ Once your test set is translated, you can compile a Blast annotation file for your evaluators. In order to do it, you need a source text file, its Moses-generated translation, and the corresponding alignment information file generated by Moses. You should also specify mwe2blast.txt generated by the previous script as well as a category file containing the error typology which your evalators will use to make their judgements. [mono] mwe2blast.exe source.en translated.fr alignment.out mwe2blast.txt Adequacy-category.blast test This will create test.blast containing sentences from the source file accompanied by their translations. It will align words which constitute an MWE in a source sentence to their translation in a target sentence based on the information from mwe2blast file and the alignment information generated my Moses. 3.1 FILTERING THE BLAST FILE ============================ Before starting the experiment, you may want to inspect the resulting Blast file with Blast in order to find sentences which you want to exclude from the file if some of them, for example, were erroneously identified by mwetoolkit or too long. We wrote a script, filterblast.exe, which allows you to filter out sentences based on a number of criteria. It is advisable to create a special item in the category file to mark irrelevant sentence in the category file with. If you want to mark some sentences for deletion, you should annotate them with either the same item or any item of the same level. You should use the -exclude option to specify either level, item, or both, to filter out sentences with specified annotation. You can use the -maxlength option to exclude sentences based on their wordcount. Since in our experiment we dealt with phrasal verb (PV) in split word order, the script is particularly tailored to deal with sentences extracted due to PV misidentification. This is done by means of patterns, which describe the most common cases of misidentification. The list of patterns is given in README. If you do not work with this type of phrasal verb, you can disable the patterns with the -diswordfilt option. The following script will filter out all the sentences which have more than 50 words and are annotated with adeq-no as well as those ones which match one the patterns of PV misidentification. [mono] filterblast.exe test.blast -maxlength 50 -exclude adeq-no This will create filtered_test.blast containing only those sentences which have passed all the filters and will also provide the number of sentences excluded according to each category. 3.2 JOINING TWO BLAST FILES =========================== Sometimes, you may want to translate your source text using different MT approaches i.e. phrase-based and hierarchical, and have it judged in order to find out to which extent one is better than the other. However, if your Blast file is quite large, the experiment may take a lot of time. Moreover, one can notice that sometimes translations of an MWE produced by two MT systems do not differ i.e. the aligned words are the same. If we are not interested in the other parts of the translated sentence, then we can judge only one translation disregarding another. Translations with minor differences can usually be not taken into account. Therefore, in order to reduce annotation time, one can merge two Blast files so that one will not need to annotate the same translation or very similar ones twice. Nevertheless, translations which differ significantly must be both included into a new file. The mergeblast script alows you to not only merge two Blast files in such a way but also to specify the degree of dissimilary in order to distinguish translations and to set limits on the number of equal and different translations included. The script uses the longest common substring algorithm to determine whether two translations are sufficiently different. Please, consult README to find more details on how two traslations are compared. [mono] joinblast.exe phrase.blast syntax.blast final -equallimit 50 -diflimit 75 -dissimdegree 2.5 -skiptransl either This will create final.blast containing 50 source sentences whose translations do not differ (25 traslations from each file chosen randomly) and another 150 sentences to judge corresponding to 75 source sentences. Two different translations of a source sentence are put consecutively and in random order. Translations which are not completely equal but do not satisfy the dissimilarity criterion are excluded. Also, an XML file will be created containing the information about the origin of each sentence, whether its translation is different to the one from the other file, and its index in both original and joint Blast files.