TailedTS: A Benchmark for Heavy-Tailed Time Series Prediction
A new benchmark dataset named TailedTS has been developed by researchers, utilizing hourly page view data from Wikipedia throughout 2024. This dataset aims to evaluate time series forecasting models in challenging conditions characterized by heavy tails, zero inflation, and non-Gaussian distributions. It consists of around 24.69 billion data points, covering approximately 3 million distinct Wikipedia pages each month, and is stored in the efficient Apache Parquet format. Wikipedia's traffic exhibits a strong power-law distribution, with about 5% of pages generating over 70% of total views, offering a rigorous environment for testing model resilience against extreme fluctuations, unlike existing datasets such as M4, M5, and UCI electricity. TailedTS also supports various research tasks, including a framework for periodicity quantification using sparse autoregression with constraints on sparsity and non-negativity.
Key facts
- TailedTS is a benchmark dataset for heavy-tailed time series prediction.
- Dataset derived from Wikipedia hourly page views throughout 2024.
- Contains approximately 24.69 billion data points.
- Spans roughly 3 million unique Wikipedia pages per month.
- Data stored in Apache Parquet format.
- Wikipedia traffic follows a power-law distribution: 5% of pages account for over 70% of views.
- Designed to test models under heavy-tailed, zero-inflated, non-Gaussian conditions.
- Includes a periodicity quantification framework using sparse autoregression.
Entities
Institutions
- arXiv