NorBERTo: New Portuguese Encoder Model Trained on 331B Tokens

ai-technology · 2026-05-04

NorBERTo, a state-of-the-art encoder-only language model for Portuguese, has been unveiled by researchers, utilizing the ModernBERT framework. This model is designed to handle lengthy contexts and incorporates efficient attention mechanisms. It was developed using Aurora-PT, a newly assembled Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens sourced from a variety of web platforms and existing multilingual datasets. In performance evaluations, NorBERTo-large excelled among encoder models on the PLUE dataset, achieving an F1 score of 0.9191 on MRPC and 0.7689 accuracy on RTE. Furthermore, it secured the highest entailment F1 score of around 0.904 on ASSIN 2. The model is built upon previous advancements like BERTimbau and Albertina PT-BR.

Key facts

NorBERTo is based on ModernBERT architecture.
Trained on Aurora-PT corpus with 331 billion GPT-2 tokens.
Aurora-PT is a Brazilian Portuguese corpus from web sources and multilingual datasets.
NorBERTo-large achieves 0.9191 F1 on MRPC (PLUE).
NorBERTo-large achieves 0.7689 accuracy on RTE (PLUE).
NorBERTo-large achieves ~0.904 entailment F1 on ASSIN 2.
Model builds on BERTimbau and Albertina PT-BR.
Supports long-context and efficient attention.

Entities

—

Sources

arXiv cs.AI — 2026-05-04