A core tenet of empirical research is that a robust sample offers a better approximation of tendencies within the broader target population. Despite this, datasets developed for music research do not always reflect real-world population diversity regarding an artist or composer's racial, ethnic, and/or gender identity. Corpora of Western European art music (i.e., "classical" music) demonstrate this with an overt demographic bias toward a small collection of composers who are white and male (e.g., Devaney et al., 2015; Neuwirth et al., 2018). Popular-music corpora the Rolling Stone 200 (RS200; de Clercq & Temperley, 2011) and McGill Billboard Hot 100 (MBB; Burgoyne, Wild, & Fujinaga, 2011) meanwhile appear to sidestep this issue by featuring songs by a comparatively diverse contingent of artists. However, data from this study challenge this assumption: applying the demographic model to artist lists from the RS200, MBB, and a more robust independent sample from ultimate-guitar.com (UG; Shea, 2020) demonstrates a continued need for researcher intervention to avoid amplifying real-world biases against non-white and non-male popular-music artists in corpus studies.
Demographic biases can manifest in various ways in music. For example, Kinney (2018) reports that urban secondary schools in the United States are less likely to attract students to music elective offerings than suburban schools. Urban districts historically enroll a greater proportion of non-white students and are also underfunded due to lower support from property taxes (Reschovsky, 2016). As such, urban schools, and resultingly non-white students, are less likely to receive access to quality music education in the United States. Conversely, in music research and pedagogy, resources such as corpora or textbooks can marginalize certain demographic groups through canonization. That is, when a resource chooses to include one work over another, these resources implicitly signal the works as more important. Palfy and Gilson (2018) and Ewell (2020) address this issue of canonization explicitly in their surveys of music theory textbooks. These authors argue that, even though some textbooks do include a handful of works by non-white composers, the disproportional underrepresentation of Black and other non-white composers indicates that works by white composers are still those most worth studying.
This data report presents a demographic model and an accompanying database to address canonization biases in future popular-music corpus work. It specifically draws on sampling strategies in machine-learning research and interdisciplinary encoding procedures to foster more inclusive corpora building in future work along the parameters of race, ethnicity, and gender.
This report's dataset consists of demographic information for popular-music artists (n = 1,438) featured in the RS200, MBB, and UG song databases. RS200 artists (n = 121) derive from the song list featured on Trevor de Clercq's personal website. 2 MBB artist (n = 417) names were gathered from the dataset index on The McGill Billboard Project website. 3 Artist names from the UG sample (n = 1,132) were parsed from the top-rated "pop" and "rock" songs encoded in the GuitarPro file format (n = 5,393 songs) featured in Shea (2019). 4 Demographic data were gathered from a variety of public sources, including Wikipedia.org, nndb.com, artist websites, and published artist interviews in print and online magazines.
Data are presented in a CSV format. A searchable and downloadable version is available on Google Sheets. 5 No licenses are required to access the data.
Demographic data are generated by following interdisciplinary encoding practices, including those in health sciences (Hauck et al., 2011) and survey reports conducted by the USC Annenberg Inclusion Initiative (Smith et al., 2020a, 2020b) and the Institute for Composer Diversity ("Composer Diversity Database," 2021). The following section summarizes the encoding procedure as it corresponds to evaluating the parameters of gender, race, ethnicity, and primary status.
Broadly, the encoding procedure involves searching online resources for demographic information about popular-music artists and encoding this information as a series of dichotomous variables. 6 These variables are strictly operationalized to avoid making problematic assumptions about an artist's identity. For example, encoders are not permitted to encode variables based on visual evidence alone. Encoders also cannot encode certain variables, such as ethnicity, unless they are made explicit by the found source. To mitigate risk of artist mis-categorization, encoders are provided with a set of guidelines and a training sample before beginning data entry. The encoders then meet with the database developer to discuss their findings, clarify ambiguities, and cross-validate their results.
The author encoded demographic data for the RS200 and MBB corpora. A team of four undergraduate and graduate students encoded the UG sample. To ensure accuracy, the author then hired an independent reviewer to verify all demographic data as it is included in this report.
Gender, race, and ethnicity are encoded under the condition that at least one member of the ensemble meets the described demographic criteria. The primary status condition, established by Shea (2019, p. 94), considers a minoritized artist's agency and identity in proportion with other ensemble members to avoid tokenism. That is, primary status and its related gender and BIPOC (i.e., "Black, Indigenous, and people of color," Garcia, 2020) conditions consider the demographic makeup of the entire ensemble.
Artist gender is defined via two parameters, represented in columns. The non-male column parameter indicates the identity of artists whose gender corresponds to those historically assigned at birth as "male" or "female." 7 These labels are also implemented for the gender identity of transgender artists. A "1" is assigned to the non-male column if any artist within the ensemble identifies as non-male, while a "0" is assigned for artists who identify as male. The non-cis variable is meanwhile reserved for any artist within the ensemble whose gender identity does not align with those that have historically been assigned at birth. 8 A "1" in the non-cis column indicates an artist identifies as non-cisgender (e.g., transgender, non-binary, etc.) and a "0" indicates the reverse. If an artist's gender identity under the non-cis variable cannot be ascertained, the encoder enters a "0" for this column and defers to the pronouns used in the online source to categorize the artist accordingly under the non-male column. 9
Race and ethnicity are encoded as separate column parameters. Conditions for these parameters frequently overlap, but their distinction in the dataset reflects the broad ways in which non-white persons have been subject to marginalization. Artist race is treated as the socially determined distinction between human groups based on "perceived common physical characteristics" that are inherent from birth (Cornell & Hartman, 1998, p. 24). Race is primarily externally imposed on marginalized persons, such as by white Europeans as they enslaved African peoples (Cornell & Hartman, 1998, p. 24). Artist ethnicity is meanwhile defined as "a sense of common ancestry based on cultural attachments, past linguist heritage, religious affiliations, claimed kinship, or some physical trait" (p. 19). 10 Hispanic or Latino persons are treated by the United States Census as members of an ethnic category, while white, Black/African American, American Indian or Alaska Native, Asian, and Native Hawaiian or other Pacific Islander are all racial categories. 11
When categorizing artists, encoders adhere to the following guidelines: encoders do not 1) encode race or ethnicity based on visual evidence alone, 2) encode race or ethnicity unless it is made explicit in the online resource, and 3) distinguish artists who are multiracial or multiethnic (e.g., Beyoncé) from those who are not. A "1" in the race or ethnicity column indicates an artist is non-white via the outlined parameters.
The last column parameter, primary status, aims to disrupt demographic tokenism within the dataset. This parameter considers the demographic makeup of the entire ensemble. As an example, D.H. Peligro is the drummer for Dead Kennedys and is the only Black member of an otherwise all-white group. Guitarist and singer-songwriter Tracy Chapman is also a Black musician who leads backing bands whose demographic makeup varies depending on the performance, but whose members are often white. While Peligro and Chapman's identities as Black artists are underrepresented — within their ensembles and across the sampled popular-music corpora — Chapman arguably holds increased agency as the title artist, lead singer, and primary songwriter of her group. She is both the public face and has a large degree of creative control compared to Peligro. Equating the Dead Kennedys as demographically analogous to Tracy Chapman therefore runs the risk of tokenizing Peligro's Blackness as currency for a somewhat flimsy measure of diversification. Which is to say, it is inappropriate to categorize the Dead Kennedys as a primarily diverse group just because one member is Black.
A group is considered primarily diverse under the primary status column if 1) more than half of its members are minoritized by race, ethnicity, or gender, or 2) if the public-facing or title member of the ensemble is of minoritized demographic status. The former measures primacy by proportion, while the latter does so by artist agency. However, ambiguous cases where proportion and agency are at odds inevitably arise. Encoders use their own discretion to make judgements as supported by the evidence available to them and offer notes in the comments column for clarification.
A final consideration regarding primary status is selectivity during sampling. In a study on the field of music theory's white racial frame, Ewell (2020) criticizes the Society for Music Theory's Committee on Race and Ethnicity for splitting its focus amongst many types of diversity, which he argues marginalizes efforts to foster racial diversity (para. 6.3). Similarly, sampling artists via the primary status parameter as it stands runs the risk of over-fitting the data on categories such as gender at the sacrifice of race or ethnicity. Put another way, the primary status category acts as a broader measure of different types of diversity, but currently does not allow researchers to parse for primary-status artists along the lines of race or gender in isolation. In response to this potentially problematic lack of nuance, two additional parameters are implemented. Following Strmic-Pawl et al. (2018) and Smith et al. (2020a), race and ethnicity are synthesized in coordination with primary status to generate the BIPOC column. The gender column is similarly generated from the non-male, non-cis, and primary status parameters. 12 The BIPOC and gender columns therefore indicate primary-status artists who are so by measures of race and ethnicity or gender, respectively. Table 1 describes each sampling condition summarized in this section.
Column | Marker | Description | Example artist/ensemble |
---|---|---|---|
Non-male | 0 | All-male ensemble | Led Zeppelin |
1 | Any other distribution of member identity | The Cranberries; Dolores O'Riordan (female) | |
Non-cis | 0 | All cisgender members | The Beatles |
1 | Any other distribution of member gender identity | Fever Ray; Karin Dreijer (they/them) | |
Race | 0 | All white-ensemble | The Who |
1 | Any other distribution of member identity | Fine Young Cannibals; Roland Gift (Black) | |
Ethnicity | 0 | All-white ensemble | Johnny Cash |
1 | Any other distribution of member identity | Hombres G; all members (Spanish) | |
Primary status | 0 | Non-white/male member assumes a secondary role and constitutes less than half of the membership of the ensemble OR there are no non-white/male members. | Coldplay |
1 | Non-white/male member is a founding member, forward-facing member of the ensemble, and/or composes material OR more than half of public-facing members are non-white/male. | Little Mix; all members (female), Leigh-Anne Pinnock (Black) |
Other identifying and potentially marginalizing characteristics such as sexuality, disability, level of education, and age are not summarized as column parameters and therefore are not used in the current sampling procedure. However, encoders frequently noted these relevant characteristics when observed. Database users can view related data by artist under the comments column. Finally, the author assumes all responsibility for any mis-categorization of an artist. The author also recognizes that one's gender identity can shift. As such, database users can submit requests to change or update an artist's demographic characteristics via Google Forms. 13 All requests will be reviewed for accuracy and implemented as soon as possible.
The following section summarizes demographic trends in the RS200 and MBB popular-music corpora. The model is also applied to the UG artist sample to compare how these trends might reflect in an independent and more robust collection of artists (n = 962 unique artists). 14 RS200 and MBB artists are those whose songs were respectively selected for encoding by measures of critical acclaim and commercial success, while artists included in the UG sample are based on online musician ratings of song transcriptions. The UG sample therefore provides an alternative measure of artist popularity.
Figures 1 and 2 respectively model the distribution of artist identity under the gender and BIPOC parameters in each individual corpus sample. Table 2 provides summary statistics for each parameter across the combined corpora. As shown, all three samples largely prioritize white and male artists, even under the less restrictive conditions of non-male, non-cis, race, and ethnicity. The Appendix includes additional summary figures, including by song count for the RS200 and MBB corpora.
At least one member | Agency and proportion | |||||||
---|---|---|---|---|---|---|---|---|
corpora | non-male | non-cis | race | ethnicity | primary | BIPOC | gender | n |
MBB | .237 | .002 | .372 | .089 | .452 | .370 | .217 | 414 |
RS200 | .182 | .000 | .397 | .116 | .438 | .380 | .165 | 121 |
UG | .226 | .007 | .120 | .118 | .253 | .115 | .187 | 1126 |
The purpose of this data report is not to determine a universal benchmark for artist diversity in corpus studies. As shown by Smith et al. (2020b), artist demographics can vary widely across historical period and genre (p. 20), meaning researchers will need to determine which parameters best suit their representational needs. However, once these needs are assessed, a few lines of R code can be used to generate a suitable sample.
One such application could be to address the relative underrepresentation of Black artists in rock music. Johnson (2018, p. 38) and Redd (1985, p. 41) both argue that Black artists were essentially sequestered from measures of mainstream commercial success when the Billboard charts implemented the "rhythm and blues" genre label that distinguishes works by Black artists from those by white artists under the "rock" genre label. Redd specifically argues that the racial motivations for these labels are clear given that rock music and rhythm and blues music are functionally equivalent. Given the propensity for music theory studies, including the RS200, to prefer the umbrella term "rock" to encompass a wide variety of genres, there is an obvious concern that Black artists are unduly overlooked in existing resources. Similarly, female Black musicians such as Sister Rosetta Tharpe and Memphis Minnie were seminal in establishing the stylistic norms of rock music (Jackson, 1995; Lewis, 2018), but are subsequently underrepresented in current corpora. The accompanying R code takes these observations into consideration. 15 Specifically, it outlines how to create an artist sample (n = 200) that prioritizes BIPOC artists using R packages from the tidyverse (Whickham et al., 2019) and "splitstackshape" (Mahto, 2019, p. 26) when applied to the combined corpora artist database. This hypothetical sample has the following distribution of artist parameters: 50% of primary-status artists by race or ethnicity (n = 100), half of whom are non-male (n = 50), added to a random sample (n = 100) of other artists.
This article has been copy edited and layout edited by Jonathan Tang.
non-male | non-cis | race | ethnicity | primary | BIPOC | gender | n |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 193 |
0 | 0 | 1 | 0 | 1 | 1 | 0 | 83 |
1 | 0 | 1 | 0 | 1 | 1 | 1 | 53 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 33 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 17 |
0 | 0 | 1 | 1 | 1 | 1 | 0 | 9 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 |
0 | 0 | 0 | 1 | 1 | 1 | 0 | 4 |
1 | 0 | 0 | 1 | 1 | 1 | 1 | 3 |
0 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
non-male | non-cis | race | ethnicity | primary | BIPOC | gender | n |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 53 |
0 | 0 | 1 | 0 | 1 | 1 | 0 | 32 |
1 | 0 | 1 | 0 | 1 | 1 | 1 | 13 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 11 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 6 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
non-male | non-cis | race | ethnicity | primary | BIPOC | gender | n |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 737 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 151 |
0 | 0 | 1 | 1 | 1 | 1 | 0 | 48 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 40 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 31 |
1 | 0 | 1 | 1 | 1 | 1 | 1 | 22 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 20 |
0 | 0 | 1 | 0 | 1 | 1 | 0 | 18 |
1 | 0 | 0 | 1 | 1 | 1 | 1 | 16 |
1 | 0 | 1 | 0 | 1 | 1 | 1 | 16 |
0 | 0 | 0 | 1 | 1 | 1 | 0 | 8 |
0 | 0 | 1 | 1 | 0 | 0 | 0 | 7 |
1 | 1 | 0 | 0 | 1 | 0 | 1 | 4 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 3 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
At least one member | Agency and proportion | |||||||
---|---|---|---|---|---|---|---|---|
corpora | non-male | non-cis | race | ethnicity | primary | BIPOC | gender | n |
MBB | 165 | 1 | 234 | 69 | 298 | 233 | 153 | 734 |
RS200 | 26 | 0 | 73 | 30 | 80 | 71 | 24 | 200 |