Skip to main content
No Access

Parallel corpora and WordSpace models: using a third language as an interlingua to enrich multilingual resources

Published Online:pp 299-313https://doi.org/10.1504/IJICT.2011.043626

In this paper we describe an automatic method for enriching multilingual resources by using WordSpace models and parallel corpora. This paper demonstrates that it is possible to enhance translation resources for a pair of language without direct parallel corpora, using instead a third language for which parallel corpora are available for both the target languages. First, we demonstrate the effectiveness of the proposed bilingual model, and then demonstrate how can this model been used for overcoming the handicap due to the absence of the resources like parallel dictionaries and corpora for less widespread languages by proposing an interlingua model. Our experiments show that precision for translation of word pairs through an interlingua model reaches ~85%, compared to ~90% for a direct bilingual model.

Keywords

WordSpace model, parallel corpora, random projection, multilingual ontology, synonymy

References

  • 1. Baeza-Yates, R. , Ribiero-Neto, B. (1999). Modern Information Retrieval. ACM Press, Addison Wesley Google Scholar
  • 2. Bingham, E. , Mannila, H. (2001). ‘Random projection in dimensionality reduction: applications to image and text data’. Proceedings of the 7th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining (KDD-2001). 26–29 August, San Francisco, CA, USA, 245-250 Google Scholar
  • 3. Blei, D. , Ng, A. , Jordan, M. (2003). ‘Latent dirichlet allocation’. Journal of Machine Learning Research. 3, 993-1022 Google Scholar
  • 4. Brown, P. , Cocke, J. , Della Pietra, S. , Della Pietra, V. , Jelinek, F. , Mercer, R. , Roossin, P. (1988). ‘A statistical approach to machine translation’. in Proceedings of the 12th Annual Conference on Computational Linguistics (COLING 88). Budapest, Hungry, 71-76 Google Scholar
  • 5. Brown, P. , Cocke, J. , Della Pietra, S. , Della Pietra, V. , Jelinek, F. , Lafferty, J. , Mercer, R. , Roossin, P. (1990). ‘A statistical approach to machine translation’. Computational Linguistics. 16, 2, 79-85 Google Scholar
  • 6. Dasgupta, S. (2000). ‘Experiments with random projection’. in Proc. Uncertainly in Artificial Intelligence, UAI. 143-151 Google Scholar
  • 7. Deléger, L. , Merkel, M. , Zweigenbaum, P. (2006). ‘Using word alignment to extend multilingual medical terminologies’. in the Proceedings of Language Resources and Evaluation 2006, Workshop on Acquiring and Representing Multilingual, Specialized Lexicons: The Case of Biomedicine. Genova, 9-14 Google Scholar
  • 8. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA:MIT Press Google Scholar
  • 9. Gale, W.A. , Kenneth, W.C. (1991). ‘Identifying word correspondences in parallel texts’. in Fourth DARPA Workshop on Speech and Natural Language, February, Asilomar, CA, 152-157 Google Scholar
  • 10. Gambäck, B. , Olsson, F. , Sahlgren, M. , Zovato, E. (2007). ‘Deliverable 3.4 survey of state of the art methods for content acquisition and representation’. SICS Google Scholar
  • 11. Gaussier, E. (1998). ‘Flow network models for word alignment and terminology extraction from bilingual corpora’. in Proceedings of the Joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics. 10–14 August, 444-450 Google Scholar
  • 12. Hofmann, T. (1999). ‘Probabilistic latent semantic analysis’. in Proc. of Uncertainty in Artificial Intelligence, UAI’99. Stockholm, 289-296 Google Scholar
  • 13. Indyk, P. , Motwani, R. (1998). ‘Approximative nearest neighbors: towards removing the curse of dimensionality’. in Proc. 30th Symp. on Theory of Computing. ACM, 604-613 Google Scholar
  • 14. Johnson, W.B. , Lindenstrauss, D.J. (1984). ‘Extensions of Lipshitz Mapping into Hilbert Space’. in Conference in Modern Analysis and Probability, Amer. Math. Soc., 189-206, Vol. 26 of Contemporary Mathematics Google Scholar
  • 15. Kaski, S. (1998). ‘Dimensionality reduction by random mapping: fast similarity computa-tion for clustering’. in Proceedings of the IJCNN’98, International Joint Conference on Neural Networks. IEEE Service Center Google Scholar
  • 16. Kurimo, M. , Oja, E. Kaski, S. (1999). ‘Indexing audio documents by using latent semantic analysis and SOM’. Kohonen Maps. Elsevier, 363-374 Google Scholar
  • 17. Landauer, T. , Littman, M. (1990). ‘Fully automatic cross-language document retrieval using latent semantic indexing’. in Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research. October, Waterloo, Ontario, 31-38 Google Scholar
  • 18. Lund, K. , Burgess, C. (1996). ‘Producing high-dimensional semantic spaces from lexical co-occurrence’. Behavior Research Methods, Instrumentation, and Computers. 28, 2, 203-208 Google Scholar
  • 19. Mausam , Soderland, S. , Etzioni, O , Weld, D. , Skinner, M. , Bilmes, J. (2009). ‘Compiling a massive, multilingual dictionary via probabilistic inference’. in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Google Scholar
  • 20. Melamed, D. (1996). ‘Automatic construction of clean broad-coverage translation lexicons’. in 2nd Conference of the Association for Machine Translation in the Americas, Montreal, Canada Google Scholar
  • 21. Melamed, D. (1997). ‘Automatic discovery of non-compositional compounds in parallel data’. in Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97). 1–2 August, Providence, RI, 97-108 Google Scholar
  • 22. Melamed, D. (2000). ‘Models of translational equivalence among words’. Computational Linguistics. 26, 2, 221-249 Google Scholar
  • 23. Névéol, A. , Ozdowska, S. (2005). ‘Extraction bilingue de termes médicaux dans un corpus parallèle Anglais/Français’. in Proceedings EGC’05. 655-664 Google Scholar
  • 24. Oard, D. (1997). ‘Cross-language text retrieval research in the USA’. in Third DELOS Workshop on Cross-Language Information Retrieval, March, Zurich, 7-16 Google Scholar
  • 25. Och, F. , Ney, H. (2002). ‘Discriminative training and maximum entropy models for statistical machine translation’. in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). July, Philadelphia, 295-302 Google Scholar
  • 26. Papadimitriou, C.H. , Raghavan, P. , Tamaki, H. , Vempala, S. (1998). ‘Latent semantic indexing: a probabilistic analysis’. in Proceedings of the 17th ACM Symposium on the Principles of Database Systems. ACM Press, 159-168 Google Scholar
  • 27. Resnik, R. , Smith, N. (2003). ‘The web as a parallel corpus’. Computational Linguistics. 29, 3, 349-380 Google Scholar
  • 28. Sahlgren, M. , Hansen, P. , Karlgren, J. (2002). ‘English-Japanese cross-lingual query expansion using random indexing of aligned bilingual text data’. Proceedings of the 3rd NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering. 8–10 October, Tokyo, Japan Google Scholar
  • 29. Sahlgren, M. (2004). ‘Automatic bilingual lexicon acquisition using random indexing of aligned bilingual data’. in Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC’04. 1289-1292 Google Scholar
  • 30. Sahlgren, M. , Coster, R. (2004). ‘Using bag-of-concepts to improve the performance of support vector machines in text categorization’. in Proceedings of the 20th International Conference on Computational Linguistics, COL ING’04. 487-493 Google Scholar
  • 31. Sahlgren, M. (2006). ‘The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces’. Department of Linguistics, Stockholm University, PhD dissertation Google Scholar
  • 32. Salton, G. , McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill Google Scholar
  • 33. Widdows, D. , Dorow, B. , Chan, C.K. (2002). ‘Using parallel corpora to enrich multi-lingual lexical resources’. in Third International Conference on Language Resources and Evaluation (LREC-02), Las Palmas, Spain, 240-245 Google Scholar
  • 34. Widdows, D. (2003). ‘Unsupervised methods for developing taxonomies by combining syntactic and statistical information’. in Proceedings of Human Language Technology/North American Chapter of the Association for Computational Linguistics (HLT-NAACL-03). Edmonton, Canada, 276-283 Google Scholar
  • 35. Widdows, D. (2004). Geometry and Meaning. CSLI Publications Google Scholar
  • 36. Widdows, D. , Ferraro, K. (2008). ‘Semantic vectors: a scalable open source package and online technology management application’. The 6th Edition of the Language Resources and Evaluation, (LREC2008). Marrakech, Morocco Google Scholar