13 It is of course possible that a subject in the automatically translated model is associated with several themes different from the reference model. For example, while „agriculture” corresponds to 12-45, it could be that the „farmer” strain is the most heavily loaded on topics 12 and 33, resulting in two different theme pairings for theme 12 in the automatically translated template (i.e. 12-45 and 12-33). In these cases, we use the combination of themes with the largest number of pairs of topics while ignoring the other. This results in pairs of subjects that always consist of the two subjects who share their words with the highest load. Another topic relevant to this study concerns the influence of certain languages and language groups on the quality of machine translation. For example, automatically translated texts may be of better quality if they are translated from French to English than if they are translated from Polish to English. There are two reasons for this. First, some language pairs are simply easier to translate than others (Koehn and Monz Reference Koehn and Monz2006). In addition, for some language pairs, larger parallel corpora are available to form machine translation models than for others (e.g. B more parallel data are available for French and English than for Polish and English). To investigate this possibility, we include in our analysis languages of different language groups: French and Spanish (belonging to the Italic language group), German and Danish (belonging to the Germanic language group) and Polish (belonging to the Balto-Slavic language group). Footnote 5 Your application can display translation results obtained from the Cloud Translation API in response to a user`s action.
Each time you view this information, you must indicate the association with Google with one of the following badges. Attribution badges on web pages must be linked to translate.google.com. You can download a zip file of the badges here. The Norwegian Labour and Social Services Authority (NAV) has started looking for interpretation providers for a framework agreement worth an estimated €40 million (NOK 417 million). Suppliers have until August 18, 2021 to apply for the two-year contract with two possible one-year extensions. For more questions about other uses of Google brand labels, see www.google.com/permissions/guidelines.html This white paper evaluates the value of machine translation for automated word bag templates. Footnote 1 We identify and evaluate four reasons why the meaning of a text may be lost in translation. First, a common problem arises when words or stems are translated differently in automatically translated documents than in reference documents, resulting in different term document matrices (CTDs). Footnote 2 We assess this problem by comparing the overlap between the gold standard and the automatically translated TDSs. Other translation issues relate specifically to LDA subject modeling, a popular word bag model that identifies the topics in a corpus and assigns documents and words to those topics. In this case, translation problems may arise because (1) the topics in the automatically translated corpus may be distributed differently than in the reference corpus, (2) the automatically translated documents may be assigned to different topics than the reference documents, and (3) a topic in the automatically translated corpus consists of words different from those of the same subject in the gold standard corpus. We evaluate each problem by systematically comparing the thematic models that are valued with automatically translated documents with those evaluated with human-translated documents (gold-standard).
11 Recent research shows that seemingly harmless pre-processing steps can influence the outcome of automated text analysis (Unattended) (Denny and Spirling Reference Denny and Spirling2018; Greene et al. Reference Greene, Ceron, Schumacher and Fazekas2016). However, our comparison is between reference texts and automatically translated texts, to which we applied identical pre-processing steps. While we can`t be sure, we don`t expect these pre-treatment steps to have consistently had different effects on the two corpora. However, we should note that removing stop words usually affects the distribution of words and topics in a topic template, and this also applies to the model results presented here. However, since stop words, by definition, do not contain up-to-date content, we expect that their removal has had minimal impact on the content. We first compare – at the document level – the machine-translated and reference words with the similarity function built into the quanteda R package (Benoit and Nulty Reference Benoit and Nulty2013). Figure 3 shows the distribution of cosine similarity values for each language.
Above all, the average similarity between reference documents and their automatically translated counterparts is very high ($M = $0.92, $SD = $0.07). In addition, more than 92% of all document pairs achieve a cosine similarity value of 0.80 or higher. These results show that the CT Scans of machine-translated and reference documents are very similar. Very often, the strains appear in machine-translated and reference documents with (approximately) the same frequency. Authors` note: Replication code and data are available in the Political Analysis Dataverse (De Vries, Schoonvelde, and Schumacher 2018), while additional documents for this article are available on the Political Analysis website. The authors would like to thank James Cross, Aki Matsuo, Christian Rauh, Damian Trilling, Mariken van der Velden and Barbara Vis for their helpful comments and suggestions. GS and MS recognise funding for the European Union`s Horizon 2020 research and innovation programme under grant agreement No 649281, EUENGAGE. EdV approves funding for a research assistant position at the Access Europe Research Centre (since 2018: UVAccess Europe) at the University of Amsterdam.
Figure 4. Unique TDM capabilities for reference corpora and automatically translated. Reading example: For the French language, the number of overlapping features is about 28,000, while the total number of features is about 33,000 for automatically translated documents and about 38,000 for reference documents. When using word bag templates, it is common to pre-process the data to eliminate noise. In our case, we removed punctuation, numbers, and general stop words, and all the remaining words were lowercase and tracked. The steps for pre-processing the reference text and the automatically translated text are identical and have been applied to the translated texts. Footnote 11 To perform these preprocessing steps, we used the Python and R libraries. For the removal of roots, stop words, numbers, lowercase letters, and punctuation marks, we used regular expressions in Python and in the NLTK package (Bird, Klein and Loper Reference Bird, Klein and Loper2009).
To create the CTDs, we switched to R and the Quanteda package (Benoit and Nulty Reference Benoit and Nulty2013). Footnote 12 We compare the CTDs of machine-translated and reference documents and also use them as input for the thematic models described below. Readers who are primarily interested in our CT scan can skip the next section, which provides more technical details on specifying our thematic models. The framework agreement follows the government`s 2019 proposal for an interpretation law that would regulate the field of interpretation and require civil servants to use interpreters in certain situations. In addition to the increased use of interpretation services, the objective of the Act is to improve quality. Our next challenge is to respond to the themes generated by the reference models and translated automatically. This is because the order of the themes can be different in both templates (that is, theme 1 in the automatically translated template can best match, for example, theme 2 in the standard gold template). Our matching procedure is as follows: for each strain we find the highest load in the automatically translated subject template and in the reference theme template.
Take, for example, the „agriculture” tribe. This strain is subject 12 of the machine-translated model and subject 45 of the most loaded Gold Standard model (is the most important). This results in a pairing of 12-45 themes for this particular strain. Next, we count the pairs of themes from all the common tribes. We match topics based on the largest number of topic pairs. For example, we combine topic 12 of the machine-translated model with topic 45 of the reference model because they have the largest number of important and common strains such as the „agriculture” strain (see the additional appendix for a numerical example of our subject matching process). Footnote 13 With this procedure, we compared 90 subjects for the German corpus and 89 subjects for all other languages. Footnote 14 $^{,}$ Footnote 15 To be clear, we are not implying that machine translation is not useful for analyzing texts in multiple languages. As Lotz and Van Rensburg (reference Lotz and Van Rensburg2014) show, developments in machine translation systems are progressing rapidly and their quality increases significantly over time. Balahur and Turchi (reference Balahur and Turchi2014) provide a comprehensive overview of the use of automatically translated text for automated analysis in the context of sentiment analysis, and Courtney et al. (reference Courtney, Breen, McMenamin and McNulty2017) note that automatically translated newspaper articles can be reliably classified by human programmers. But while these articles are highly relevant, they don`t evaluate the implications of machine translation for word bag methods in general.
The same goes for Lucas et al. (Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015), who write extensively about the possible pitfalls in the analysis of automatically translated texts, but do not empirically evaluate their quality. .