ARTFEED — Contemporary Art Intelligence

OceanPile: Large-Scale Multimodal Corpus for Ocean AI

other · 2026-05-06

Researchers have introduced OceanPile, a large-scale multimodal corpus designed to overcome data fragmentation in ocean science. Ocean data are scattered across sources, noisy, and weakly labeled, hindering AI applications. OceanPile includes OceanCorpus, which integrates sonar data, underwater imagery, and marine science visualizations into a unified schema. This dataset aims to enable Multimodal Large Language Models (MLLMs) to tackle ocean-related tasks, such as climate modeling and biodiversity monitoring. The work addresses a critical bottleneck in applying AI to marine environments.

Key facts

  • OceanPile is a large-scale multimodal corpus for ocean foundation models.
  • Ocean data are fragmented, multi-modal, high-noise, and weakly labeled.
  • OceanPile comprises OceanCorpus, a unified collection of sonar data, underwater imagery, and marine science visualizations.
  • The dataset aims to bridge the gap for MLLMs in ocean science.
  • The work is published on arXiv under identifier 2605.00877v1.

Entities

Institutions

  • arXiv

Sources