Speech Emotion Recognition via MFCC and LSTM

other · 2026-04-30

A new study on arXiv (2604.25938) presents a speech emotion recognition (SER) system using Mel-Frequency Cepstral Coefficient (MFCC) features and a Long Short-Term Memory (LSTM) neural network. The system processes the Toronto Emotional Speech Set (TESS) by transforming speech signals into MFCC features to capture temporal aspects, then feeds them into an LSTM model capable of learning long-term sequential patterns. The work highlights SER's growing importance in natural human-computer interaction, as emotions alter speech patterns such as pitch, energy, and timing. Challenges include speaker variability, recording conditions, and acoustic similarity between emotions. The proposed method aims to improve detection accuracy by combining MFCC feature extraction with LSTM's sequential learning capabilities.

Key facts

arXiv paper 2604.25938 introduces an SER system using MFCC and LSTM.
The system uses the Toronto Emotional Speech Set (TESS) dataset.
MFCC features are extracted from pre-processed speech signals.
LSTM model learns long-term features of sequential audio.
SER detects human emotional states from speech for human-computer interaction.
Emotions modify pitch, energy, and timing of speech.
Challenges include speaker inconsistency, recording variations, and emotion similarity.
The work is published on arXiv with Announce Type: cross.

Speech Emotion Recognition via MFCC and LSTM

Key facts

Entities

Institutions

Sources