Cross-Lingual Alignment of Word & Sentence Embeddings
[Thesis]
Aldarmaki, Hanan
Diab, Mona T.
The George Washington University
2019
113
Ph.D.
The George Washington University
2019
One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of words, such as phrases and sentences, distributional representations can be estimated by combining word embeddings using arithmetic operations like vector averaging or by estimating composition parameters from data using various objective functions. The quality of these compositional representations is typically estimated by their performance as features in extrinsic supervised classification benchmarks. Word and compositional embeddings for a single language can be induced without supervision using a large training corpus of raw text. To handle multiple languages and dialects, bilingual dictionaries and parallel corpora are often used for learning cross-lingual embeddings directly or to align pre-trained monolingual embeddings. In this work, we explore and develop various cross-lingual alignment techniques, compare the performance of the resulting cross-lingual embeddings, and study their characteristics. We pay particular attention to the bilingual data requirements of each approach since lower requirements facilitate wider language expansion. To begin with, we analyze various monolingual general-purpose sentence embedding models to better understand their qualities. By comparing their performance on extrinsic evaluation benchmarks and unsupervised clustering, we infer the characteristics of the most dominant features in their respective vector spaces. We then look into various cross-lingual alignment frameworks with different degrees of supervision. We begin with unsupervised word alignment, for which we propose an approach for inducing cross-lingual word mappings with no prior bilingual resources. We rely on assumptions about the consistency and structural similarities between the monolingual vector spaces of different languages. Using comparable monolingual news corpora, our approach resulted in highly accurate word mappings for two language pairs: French to English, and Arabic to English. With various refinement heuristics, the performance of the unsupervised alignment methods approached the performance of supervised dictionary mapping. Finally, we develop and evaluate different alignment approaches based on parallel text. We show that incorporating context in the alignment process often leads to significant improvements in performance. At the word level, we explore the alignment of contextualized word embeddings that are dynamically generated for each sentence. At the sentence level, we develop and investigate three alignment frameworks: joint modeling, representation transfer, and sentence mapping, applied to different sentence embedding models. We experiment with a matrix factorization model based on word-sentence co-occurrence statistics, and two general-purpose neural sentence embedding models. We report the performance of the various cross-lingual models with different sizes of parallel corpora to assess the minimal degree of supervision required by each alignment framework.