KOMBO: A New Korean Language Model Based on Hangeul's Subcharacter Rules
A new framework named KOMBO has been developed by researchers for Korean pre-trained language models (PLMs), drawing inspiration from the foundational principles of Hangeul as documented in King Sejong's 1446 work, Hunminjeongeum. Unlike current Korean PLMs that disregard these principles, KOMBO constructs characters by merging subcharacters in accordance with Hangeul's distinctive combination rules. This innovative approach surpasses the leading Korean PLM by an average of 2.11% across five tasks related to Korean natural language understanding. Additionally, comprehensive tests indicate that KOMBO is effective for compression purposes. The research paper can be accessed on arXiv with the ID 2604.23948.
Key facts
- KOMBO is a novel framework for Korean pre-trained language models.
- It incorporates the invention principles of Hangeul from Hunminjeongeum (1446).
- Hangeul was devised by King Sejong.
- Existing Korean PLMs overlook these principles.
- KOMBO represents characters by combining subcharacters.
- It outperforms state-of-the-art Korean PLM by 2.11% on average.
- Performance was measured across five Korean NLU tasks.
- The paper is on arXiv: 2604.23948.
Entities
Institutions
- arXiv