Mood for Music: Emotion Recognition on Acoustic Features
Our emotional response to a music fragment depends on a large set of external factors, such as gender, age, culture, and context. However, these external variables set aside, humans consistently categorise songs as being happy, sad, enthusiastic or relaxed. We’ve developed an algorithm that knows how a song will be emotionally perceived.
Music recommenders such as those used by Spotify usually only rely on a user’s past preferences to predict which songs would be of interest to a specific user. Such approach is called collaborative filtering and often tries to discover relations between users and songs by means of matrix factorization. Past preferences of a large number of users are used to discover a latent set of variables that describe these relations.
A major disadvantage of collaborative filtering, is the fact that it heavily relies on the preference history of a large userbase. As a result, popular songs have a much higher probability of being recommended to a new user than less popular songs. Furthermore, newly released songs, for which no purchasing history is available, can not be recommended at all. To solve this, one could try to model a songs’ characteristics directly, by calculating a set of accoustic features that describe the raw waveform of the music. These features can then be used to cluster songs based on their similarity. Such recommender approaches are termed content filtering, a well known application of which is the Music Genome Project.
A major hurdle in content filtering is the quest for informative features. How can we describe a song’s characteristic and how can we measure song similarity? In this article, we discuss one aspect of content filtering based music recommendation, namely emotions. Research in the field of music psychology showed that music induces a clear emotional response in its listeners. Furthermore, musical preferences have been shown to be highly correlated with personality traits and mood. The question then is: Can we automatically detect the mood of a song?
Obviously, a user’s affective response to a music fragment depends on a large set of external factors, such as gender, age, culture, and context (e.g. time of day or location). However, these external variables set aside, humans are able to consistently categorise songs as being happy, sad, enthusiastic or relaxed. At Argus Labs we recently developed an algorithm that is able to do exactly that.
An important question however, is how emotions themselves should be defined and described. A well known method in affective neuroscience, is the circumplex model of affect. This model describes the two most important dimensions of emotion, obtained through factor analysis of large manually labelled datasets. The two dimensions used in the circumplex model are valence and arousal. Valence represents a user’s well being, ranging from sad to happy, whereas arousal represents the user’s activation level, ranging from relaxed to enthusiastic. A large set of emotions can easily be defined by combining both dimensions, as shown by figure 1.
Since valence and arousal are represented by real numbers, we developed a regression model that is able to recognise emotions exhibited by music in real-time. In the current implementation, emotion estimates are reported three times per second. To encourage the machine learning community to explore the importance of emotion and mood in music, we decided to release our models as a freely available API which will be announced on our blog next week.
Genres and emotions
A first interesting observation is the close relation between genres and emotions, encouraging our further research into the importance of human emotions in music recommendation. Figure 2 shows trajectories of several hundred songs in the circumplex model, grouped by genre. Each plotted item corresponds to a one-second fragment of a song. Snake-like structures that appear in these plots are due to the temporal consistency of subsequent one-second labels in each song.
Although genre labeling was done quickly and coarse, it is easy to see how specific genres often correspond to specific subspaces of the valence-arousal grid. For example, rap is shown to exhibit high arousal (i.e. energetic, enthusiastic) with low valence (i.e. unhappy). Country music corresponds to a low arousal (quite music), and soul is perceived as both happy and enthusiastic music.
Since our training data is based on a large set of popular music, most songs are labeled as pop, rock or pop-rock. Now let us explore the regression results for some of these songs in more detail.
Energetic Rock: Down with the Sickness
Songs in the top-left corner of the valence-arousal grids of figure 2 correspond to songs that are highly energetic, yet are associated with a negative happiness. One of such songs from the rock segment is Down with the Sickness from the metal band Disturbed. The following video shows how the valence and arousal fluctuate throughout the song:
While the arousal values are relatively stable, indicating a very energetic song, the valence level fluctuates locally and clearly shows a downwards trend throughout the whole song. Furthermore, a nice example that shows how valence and arousal represent very different aspects from human emotion can be observed after about 240 seconds, where the valence suddenly increases whereas the song’s arousal goes down.
Poppy Enthusiasm: Opposites Attract
Another interesting example is the song Opposites Attract from Paula Abdul. This song is located in the top-right corner of the valence-arousal plots for both electronic and pop, shown in figure 2. The song exhibits a high valence, indicating happiness, and corresponds to a relatively high valence which indicates enthusiasm:
While the valence values are relatively stable, indicating a happy song, the arousal levels clearly follow a pattern, defined by the interplay of different voices and intonations.
Lovely, Gentle Love Songs: Have I Told You Lately
Let us now have a look at a love song, showing a slightly positive valence, i.e. happiness, and a negative arousal, i.e. energy. One such song is Have I told you lately from Rod Stewart. This song is located towards the bottom-right corner of the valence-arousal plot for rock and pop music in figure 2.
This song is a nice example of how emotions often build up throughout a song, showing an upward trend for both valence and arousal during the first half of the song.
A Tad Sad: Perfect Blue Buildings
Finally, an example of a song with low valence and low arousal on average, is Perfect Blue Buildings from The Counting Crows. This song is located in the bottom-left corner of the pop and rock plots of figure 2. The song shows some interesting fluctuations in both valence and arousal during the chorus, around second 150 and around second 240:
In this article we discussed how music elicits emotional reactions from its users, and we showed how emotions can fluctuate throughout a song. The above demonstrations are plots of the valence and arousal predictions obtained by our SDK, which will be released to the public next week.
At Argus Labs, we are working towards the age of empathic devices by recognising emotion from context, behaviour and data.