ADVANCEMENTS in computational models have opened the possibility to explore hypotheses about aspects of human cognition in a controlled and reproducible way. Coupled with cognitive theories such as predictive coding (Clark, 2013; Friston, 2003), the power of (computational) statistical models can be leveraged to design systems that can replicate the behavior of humans, and thus, allow us to study hypotheses about the human experience. Computational models of musical understanding can allow us to test specific hypotheses about the underlying processes that are involved in the way humans compose, perform and/or listen to (or more broadly, experience) music.
In their paper, Mihelač et al. (2021) propose a method for classifying folk melodies according to their regularity using the information dynamics of music (IDyOM) model (Pearce, 2005; Wiggins et al., 2012), a well-known statistical model of musical expectation. Mihelač et al. show that such a model can be used to discover complexity in music that is considered to be "simple," such as European children's folk music. In this commentary, I would like to take a step back and reflect on the concepts of regularity and irregularity (or as Mihelač et al. use in their paper, "(ir)regularity)" in musical structure, and provide a perspective on the use of data-driven statistical models to analyze musical structure.
The contributions of this paper are twofold: 1) I would like to present a critical view of the concept of regularity in musical structure, in particular connected to data-driven statistical models; and 2) I would like to present a personal vision of necessary aspects/components for comprehensive cognitively-plausible data-driven models of musical experience. 1
Before starting a more in-depth discussion, for the sake of transparency, I would like to state that I am a computer scientist working primarily on the field of music information retrieval, and thus, this commentary comes from the point of view of someone working on developing models that capture aspects of the (human) musical experience, rather than directly studying the cognitive processes involved in music perception. Furthermore, I am a supporter of the predictive coding framework, and thus, I believe that information theoretic-derived methods can be good candidate models for modeling cognitive processes. While the question of whether a truly complete computational phenomenology that captures to all aspects of the human experience is possible (or even desirable) is a debate that goes far beyond the scope of this paper (see (Harlan, 1984) for an old-fashioned and optimistic account or (Ramstead et al., 2022) for a more recent discussion), we should ponder on the question of what are the limitations of current computational and statistical approaches to model aspects of our experience of music.
The rest of this commentary is structured as follows: I will first discuss potential issues with the concept of regularity and algorithmic bias in statistical models of music. Afterwards, I present an integrated perspective on cognitively-plausible computational listeners. Finally, the last Section concludes this commentary.
"All models are wrong, but some are useful", often attributed to statistician George Box, 3 is a famous aphorism in statistics that recognizes the limitations of scientific models of describing complex (real) phenomena, while still acknowledging their usefulness. But when do the limitations of a model get in the way of its usefulness? Mihelač et al. argue that in computational models can offer "a more objective analysis of music" compared to traditional empirical approaches involving listeners, due to the subjective nature of musical experience. While I partially agree with this sentiment, I think it is important to be cautious with insights derived from statistical models of music. In this section, I want to focus on two aspects: 1) the concept of regularity in the context of statistical models of music and 2) the issue of representing/encoding music for computational models. In the rest of this section, I will focus on data-driven (statistical) models of music, since these models are the basis of many current cognitively-plausible models, but most of the discussion also applies to other kind of models.
While the concepts of regularity and irregularity are useful in many contexts, we should be careful with their application to the characterization of cultural objects like music, because of the baggage that they carry. In particular, the concept of irregularity as "deviations" from the established (typically Eurocentric) norms has many colonial implications. Examples of this issue can be seen throughout scholar music traditions of the 19th and 20th centuries, from Schenkerian analysis to Theodor Adorno's views on jazz (Adorno & Daniel, 1989), or European descriptions of non-Western traditions (e.g., 19th century descriptions of Afro-Brazilian music and dance, see Fryer, 2000 cited by Naveda, 2011). For a more in-depth discussion, see (Ewell, 2020) and references therein. Mihelač et al. avoid some of these issues by focusing on regularity in terms of syntactic structure with periodic dominant musical patterns and the relationships between these patterns. With this focus, structures that may seem syntactically regular in one piece might be perceived as irregular in another.
Still, issues with the concepts of regularity and irregularity become more complicated in the context of computational models when we consider algorithmic bias, systematic algorithmic errors that produce unintended and "unfair" outcomes. (Mitchell et al., 2021) identify two kinds of algorithmic biases: statistical and societal. Statistical bias refers to the mismatch in the distribution between the sample used to train 4 the model and the distribution of the real world. A typical example of this kind of bias occurs in facial recognition systems, where white males are disproportionately overrepresented in the training dataset, resulting in models that are worse at recognizing people of color (Introna & Wood, 2004; Van Noorden, 2020). In models of music experience, this kind of bias occurs when training models with datasets that overrepresent Western music (and in the context of Western classical music, Austro-German music in particular). For example, in the paper by Mihelač et al. the authors aimed for a good distribution of European folk music and included examples from 22 different countries. However, 124 of the 736 examples were from Germany, reflecting the widespread difficulty that researchers in the field have in achieving adequate representation, due in part to the limited availability of repositories containing repertoire from many parts of the world.
Societal bias, on the other hand, occurs when the model implicitly learns social biases 5 (e.g., gender gap, racism, etc.), even if the training dataset does reflect the real-world distribution of the data. Examples of this kind of bias have been shown, e.g., in deep learning models of natural language processing, where some state-of-the-art models have found a strong association of negative stereotypes like Muslims and terrorism (Solaiman et al., 2019). In the case of models of music, this kind of bias occurs in music recommendation systems, where the performance of these models has been shown to differ between groups of users depending on their characteristics (e.g., gender, race, ethnicity, age, etc.) (Melchiorre et al., 2021). 6
A partial solution to some of these problems would be strictly delimiting the scope/applicability of the models–assuming that models trained on a particular kind of music would not generalize well enough (i.e., be applicable) to other kinds of music to make confident predictions. While this is a somewhat unsatisfactory solution, there is a drive to develop interpretable machine learning models, and this effort is present in music research communities (Praher et al., 2021).
Computers do not "perceive" or "process" music in the same way that humans do. Typically, computers read/parse music in formats that encode just part of the information that constitutes the entire human experience of music. (Babbitt, 1965) proposed that music is a cultural construct that exists in 3 representational domains: the graphemic domain (the score), the acoustic domain (the performance, or physical realization of the score) and the auditory domain (the perception of the performance). These representational domains can be linked to the different roles in Kendall & Carterette's (1990) metabolic model of musical expression: we can consider musical expression as a communication process between a composer (graphemic), whose ideas are transformed into an acoustic signal by a performer (acoustic) and finally perceived and ultimately interpreted as a musical idea by the listener (auditory). More recent conceptualizations of this process describe it as dynamic and multi-directional, with the three domains often in constant interaction (e.g. Maes et al. (2014)). As argued by Wiggins et al. (2010), music as a whole cannot be effectively studied from the standpoint of pure audio analysis (i.e., acoustic domain), nor from that of pure music theory (i.e., the graphemic domain).
However, the complexity of handling all representational domains adds many layers of difficulty in developing models, and researchers need to trade-off ecological validity of the musical representation and the interpretability (and ultimate usefulness) of the models. Research on computational musicology tends to rely on symbolic representations of the music (i.e., machine readable version of the musical content, like MIDI, MusicXML or MEI), and these representations have some limits. For example, the way musical pitch is represented in the MIDI standard was designed to represent Western equal tempered music, and might not be the most appropriate way to represent micro-tonal music traditions like Turkish or Greek folk music. Furthermore, there is the general issue of quantization in music representations (not only in pitch, but in time, etc.), where complex aspects of music are discretized into "simpler" categories/scales, which usually tend to fit concepts of Western music traditions (e.g., equal temperament, isochronous beat grids, etc.) (Lenchitz, 2021). Still, there are efforts addressing some of these issues, like the work of the Music Encoding community, 7 and work on multi-modal modeling of music (Simonetta et al., 2019).
According to the predictive coding paradigm (Clark, 2013; Friston, 2003; Friston & Kiebel, 2009), the human brain is essentially a prediction machine, aiming to minimize the discrepancy between the organism's expectancies and imminent events. In this light, probabilistic and information theoretic models of music expectation have been shown to be adequate frameworks for developing cognitively-plausible models of music cognition (Huron, 2006; Temperley, 2007, 2019; Wiggins et al., 2010). We should, however, not wholesale discard non-cognitively plausible approaches. I am sure that many listeners of music have had their experience of music improved by learning facts about the structure of the music that would not be necessarily evident by just listening to the music (i.e., without any prior musical training). As a personal example, I collaborated with Olivier Lartillot in preparing a live visualization of Contrapunctus XIV, which is part of J. S. Bach's The Art of the Fugue BWV 1080 in a concert by the Danish String Quartet. 8 Working on this visualization helped me better understand the structure of the music and helped me to connect in an emotional level to what I used to think was a very cerebral and cold piece. Models like David Meredith's, focusing on music analysis using point-set compression (Meredith, 2016) or deep learning models of music classification have proven to be useful, and in some cases, are the state-of-the-art for many MIR tasks. In the rest of this section, I will discuss three aspects which I believe are important for designing comprehensive cognitively-plausible data-driven models of musical experience:
In this commentary, I have presented some perspectives and potential issues on computational models of musical experience. New computational models and data availability have allowed for developing cognitively-plausible models of musical experience. We should, however, be careful of the scope of the models whenever we get insights from them.
As a final (meta-)commentary, I think it is very important to emphasize the need for more interdisciplinary research involving music cognition, musicology/music theory and computer science, not only in the development of computational models of musical experience. It is only this way that we will be able to pass the limitations of each field.
This work received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme, grant agreement No 101019375 (Whither Music?).