Research

Research Interests

  • NLP methodologies for low-resource scenarios, including low-resource languages and domain-specific settings
  • Nuanced evaluation and analysis of AI in multilingual and non-English settings, integrating linguistic and computational expertise
  • Systematic study of linguistic and cultural bias in AI systems
  • Computational terminology and translation technology in domain-specific contexts

View All Publications on Google Scholar

Research Resources

ACTER Dataset

The Annotated Corpora for Term Extraction Research (ACTER) is a manually annotated dataset for automatic term extraction. It consists of 12 specialised corpora across 4 domains (corruption, dressage, heart failure, wind energy) and 3 languages (English, French, Dutch), with over 100,000 manual annotations. ACTER has become a reference dataset for term extraction research in the NLP community and was the basis for the TermEval shared task I organised in 2020.

D-Terminer Demo

An online demonstration of automatic term extraction research, showcasing both monolingual and bilingual terminology extraction capabilities.

Current Projects

FRQ-IVADO Research Chair (2026-2031)

At the Crossroads of Languages and AI: Towards a Synergy Between Language Expertise and Computational Innovation

A five-year research program exploring the intersection of linguistic expertise and computational innovation in AI systems, with a focus on multilingual NLP and responsible AI development.

Translation and Contamination (2025-2027)

Evaluation of the Translation Capabilities and English Bias in Large Language Models

Funded by Fonds de Recherche du Québec - Nature et Technologies. Investigating how English bias manifests in large language models and impacts their translation capabilities, particularly for French and other non-English languages.

PoS Tagging and Lemmatisation for Chuj (2025-2026)

In collaboration with Prof. Justin Royer. Funded by IVADO Regroupement 3 (NLP). Developing NLP tools for Chuj, an understudied Mayan language, addressing critical challenges in low-resource language processing.

Past Projects

EXTRACT: Extracting Terminology from Comparable Texts (2017-2021)

FWO PhD Fellowship. Created the ACTER dataset and developed supervised machine learning methodologies for automatic term extraction, including the HAMLET (Hybrid Adaptable Machine Learning approach to Extract Terminology) system.

SCATE: Smart Computer-Aided Translation Environment (2015-2017)

FWO-SBO project. Work package on automatic term extraction, focusing on data creation for monolingual and multilingual terminology extraction to enhance translation tools.

Medical Terminology and Translation (2018-2020)

Collaboration with medical organisations ebpracticenet and Iscientia. Investigated optimal translation and post-editing procedures for medical guidelines, comparing different translator profiles (language specialists vs. domain specialists). Developed and validated multilingual automatic term extraction methods to improve search functionality on medical websites. Research also explored corpus design for medical terminology extraction, comparing large web corpora with smaller focused corpora.