Spotify Unwrapped

Although Spotify provides a nice yearly wrap-up, as a data nerd getting hold of your raw dataset and doing your own analysis is a lot more fun. So here’s my take on my year in music.

Spoilers: no surprises who my top artists are.

The Data

Spotify makes data available to users via the Account Privacy page (see this support article for more info).

It takes a few days to generate a download archive but once it’s ready, you’ll get a zip file containing a bunch of json files as described by the Spotify data dictionary

StreamingHistory0.json contains my streaming history for the year and it’s this file I’ll be basing my report on.

Data Prep

First stage of any data exploration is to get familiar with the source data. Opening StreamingHistory0.json in Power BI / Power Query gives:

Some things to note.

  • The data is in a single transaction file and will need to be transformed into a star schema.
  • The endTime is in UTC but has no timezone information.
  • There are no keys to identify the artists or tracks uniquely.
  • Time played msPlayed is in milliseconds.
  • Overall, the data looks fairly clean; there are no error rows, no nulls etc.

Data Prep - Base Data Load

First, let’s start with tweaking the default M code generated by Get Data. When working with local file data, it’s always a good idea to parameterise paths. I’ll be using the parameter DataFolder for this.

let
    Source = Json.Document(File.Contents( DataFolder  & "\my_spotify_data\MyData\StreamingHistory0.json")),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"endTime", "artistName", "trackName", "msPlayed"}, {"endTime", "artistName", "trackName", "msPlayed"}),

Next to add timezone information to the endTime column. I’ll use the DateTimeZone.AddZone and DateTimeZone.ToLocal functions to do this:

    #"Add Local Timezone" = Table.TransformColumns(#"Expanded Column1", { {"endTime", each DateTimeZone.ToLocal(DateTime.AddZone( DateTime.FromText(_) , 0)), type datetimezone } }),

General housekeeping - fix up some types, rename the columns to something more meaningful and split the endDate into Date and Time to make the data easier to work with later (separate columns also compress better due to lower cardinality, but for these volumes this isn’t an issue):

    #"Change Types" = Table.TransformColumnTypes(#"Add Local Timezone",{ {"artistName", type text}, {"trackName", type text}, {"msPlayed", Int64.Type}}),
    #"Renamed Columns" = Table.RenameColumns(#"Change Types",{{"endTime", "End DateTime"}, {"artistName", "Artist Name"}, {"trackName", "Track Name"}, {"msPlayed", "Millis Played"}}),
    #"Add End Date" = Table.DuplicateColumn(#"Renamed Columns", "End DateTime", "End Date"),
    #"Add End Time" = Table.DuplicateColumn(#"Add End Date", "End DateTime", "End Time"),
    #"Changed Date Time Types" = Table.TransformColumnTypes(#"Add End Time",{{"End Time", type time}, {"End Date", type date}}),

Star Schemaifying - Preping for Fact & Dimension Queries

We’ll need to split the data into dimensions (Track, Artist, Date/Time) and facts (Streams). To achieve this, we’ll need to add surrogate keys for the dimensions. I’m going to use a compound key for the Track and Artist to identify the unique combination in order to generate an index. (Turns out I’m going to need this column in a visual later anyway.)

    #"Add Compound ArtistTrack Key" = Table.AddColumn(#"Changed Date Time Types", "Artist Track", each Text.Combine({[Artist Name], [Track Name]}, " - "), type text),

This may or may not be the best way to do this next bit.

To add a surrogate key for Artists and Tracks, I’m going to use Table.Distinct to get a list of unique Artist Track values, Table.AddIndexColumn to add the index and Table.NestedJoin to join the Artist Track column back to the original data. This will give me a Track Key column with the surrogate key for each unique Artist-Track.

        #"Distinct Artist Track" = Table.Distinct(Table.SelectColumns(#"Add Compound ArtistTrack Key", {"Artist Track"})),
        #"Sorted Rows" = Table.Sort(#"Distinct Artist Track",{{"Artist Track", Order.Ascending}}),
        #"Added Track Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 1, 1, Int64.Type),
    #"Merged Track Query" = Table.NestedJoin(#"Reordered Columns" , {"Artist Track"}, #"Added Track Index", {"Artist Track"}, "Track", JoinKind.LeftOuter),
    #"Expanded Track" = Table.ExpandTableColumn( #"Merged Track Query", "Track", {"Index"}, {"Track Key"}),

And a similar approach for the Artist dimension:

        #"Distinct Artist" = Table.Distinct(Table.SelectColumns(#"Reordered Columns", {"Artist Name"})),
        #"Sorted Artists" = Table.Sort(#"Distinct Artist",{{"Artist Name", Order.Ascending}}),
        #"Added Artist Index" = Table.AddIndexColumn(#"Sorted Artists", "Index", 1, 1, Int64.Type),
    #"Merged Artist Query" = Table.NestedJoin(#"Expanded Track" , {"Artist Name"}, #"Added Artist Index", {"Artist Name"}, "Artist", JoinKind.LeftOuter),
    #"Expanded Artist" = Table.ExpandTableColumn( #"Merged Artist Query", "Artist", {"Index"}, {"Artist Key"})

Final Query

Here’s the complete Power Query to load the StreamingHistory0.json file:

let
    Source = Json.Document(File.Contents( DataFolder  & "\my_spotify_data\MyData\StreamingHistory0.json")),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"endTime", "artistName", "trackName", "msPlayed"}, {"endTime", "artistName", "trackName", "msPlayed"}),
    #"Add Local Timezone" = Table.TransformColumns(#"Expanded Column1", { {"endTime", each DateTimeZone.ToLocal(DateTime.AddZone( DateTime.FromText(_) , 0)), type datetimezone } }),
    #"Change Types" = Table.TransformColumnTypes(#"Add Local Timezone",{ {"artistName", type text}, {"trackName", type text}, {"msPlayed", Int64.Type}}),
    #"Renamed Columns" = Table.RenameColumns(#"Change Types",{{"endTime", "End DateTime"}, {"artistName", "Artist Name"}, {"trackName", "Track Name"}, {"msPlayed", "Millis Played"}}),
    #"Add End Date" = Table.DuplicateColumn(#"Renamed Columns", "End DateTime", "End Date"),
    #"Add End Time" = Table.DuplicateColumn(#"Add End Date", "End DateTime", "End Time"),
    #"Changed Date Time Types" = Table.TransformColumnTypes(#"Add End Time",{{"End Time", type time}, {"End Date", type date}}),
    #"Add Compound ArtistTrack Key" = Table.AddColumn(#"Changed Date Time Types", "Artist Track", each Text.Combine({[Artist Name], [Track Name]}, " - "), type text),
    #"Reordered Columns" = Table.ReorderColumns(#"Add Compound ArtistTrack Key",{"End DateTime", "End Date", "End Time", "Artist Track", "Artist Name", "Track Name", "Millis Played"}),
        #"Distinct Artist Track" = Table.Distinct(Table.SelectColumns(#"Reordered Columns", {"Artist Track"})),
        #"Sorted Rows" = Table.Sort(#"Distinct Artist Track",{{"Artist Track", Order.Ascending}}),
        #"Added Track Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 1, 1, Int64.Type),
    #"Merged Track Query" = Table.NestedJoin(#"Reordered Columns" , {"Artist Track"}, #"Added Track Index", {"Artist Track"}, "Track", JoinKind.LeftOuter),
    #"Expanded Track" = Table.ExpandTableColumn( #"Merged Track Query", "Track", {"Index"}, {"Track Key"}),
        #"Distinct Artist" = Table.Distinct(Table.SelectColumns(#"Reordered Columns", {"Artist Name"})),
        #"Sorted Artists" = Table.Sort(#"Distinct Artist",{{"Artist Name", Order.Ascending}}),
        #"Added Artist Index" = Table.AddIndexColumn(#"Sorted Artists", "Index", 1, 1, Int64.Type),
    #"Merged Artist Query" = Table.NestedJoin(#"Expanded Track" , {"Artist Name"}, #"Added Artist Index", {"Artist Name"}, "Artist", JoinKind.LeftOuter),
    #"Expanded Artist" = Table.ExpandTableColumn( #"Merged Artist Query", "Artist", {"Index"}, {"Artist Key"})
in
    #"Expanded Artist"

Fact and Dimension Queries

Now we have a base query, we can create the queries for the fact and dimension tables quite simply by referencing our base query and removing the columns/rows we don’t need.

A Note on Reference vs Duplicate Queries

Regardless of whether you use a reference or duplicate the query, the query logic executes for each query. Reference queries are useful when you want to share the same query logic across multiple queries but there is no data caching (kinda, PQ caching is an art best not dabbled in).

See https://learn.microsoft.com/en-us/power-bi/guidance/power-query-referenced-queries

Track Dimension Query

let
    Source = StreamingHistory0,
    #"Removed Columns" = Table.RemoveColumns(Source,{"End DateTime", "End Date", "End Time", "Artist Name", "Millis Played"}),
    #"Removed Duplicates" = Table.Distinct(#"Removed Columns", {"Track Key"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Duplicates",{"Track Key", "Track Name", "Artist Track", "Artist Key"})
in
    #"Reordered Columns"

Artist Dimension Query

let
    Source = StreamingHistory0,
    #"Removed Columns" = Table.RemoveColumns(Source,{"End DateTime", "End Date", "End Time", "Track Name", "Millis Played", "Track Key", "Artist Track"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns",{"Artist Key", "Artist Name"}),
    #"Removed Duplicates" = Table.Distinct(#"Reordered Columns", {"Artist Key"})
in
    #"Removed Duplicates"

Stream Fact Query

let
    Source = StreamingHistory0,
    #"Removed Columns" = Table.RemoveColumns(Source,{"End DateTime", "Artist Name", "Track Name", "Artist Key"})
in
    #"Removed Columns"

Date and Time Dimensions

There are literally dozens of examples online of how to create date and time dimensions in Power Query. I used this example from Radacad.


Report Design

Vispiration

Not being a professional designer, report design is not my natural strong point. So some research for vispiration is in order. One of my favourite places to look for inspiration is r/dataisbeautiful and here are some of the examples I’ve drawn from:

https://www.reddit.com/r/dataisbeautiful/comments/wt7b3n/oc_most_streamed_artists_on_spotify_all_time/  I really like this vis, and it would be an interesting challenge to produce something similar in Deneb.

https://www.reddit.com/r/dataisbeautiful/comments/rpmxg5/oc_analysis_of_my_own_spotify_personal_data/ The Spotify colour palette is popular, and this is one of many examples of green, white and black.

https://www.reddit.com/r/dataisbeautiful/comments/z8xll5/every_year_for_spotify_wrapped_day_i_design_a/ Another nice use of artist images and a great colour palette but not the vibe I’m going for.

https://www.reddit.com/r/dataisbeautiful/comments/wi13wp/oc_the_most_streamed_spotify_songs_of_alltime/ Again, artist photos are my takeaway from this one.

Design Decisions

So from my brief research I’ve decided to incorporate:

  • Spotify inspired colour palette
  • Artist Images
  • Deneb Custom visuals inspired by the first example above.

Additional Data - Images

I’m going to need some images of my fav artists for my report. Time to iterate back to the data model.

The first idea I had was to automate a Bing Image Search for each artist name. Calling a REST Endpoint from Power Query is pretty straightforward, and the Bing Image Search API is well documented.

(query) =>
let
    url = "https://api.bing.microsoft.com/v7.0/images/search",
    options = [Headers=[#"Ocp-Apim-Subscription-Key"="<your-key>"], Query = [q = query, count="1"] ],
    result = Json.Document(Web.Contents(url , options)),
    value = result[value],
    value1 = value{0}[contentUrl]
in
    value1

However, I have around 2200 artists in my data set, and I didn’t fancy calling the API that many times (especially considering I probably wont need every artist image). Not to mention I’d hit my Search SKU API quota. That calls for plan B.

Plan B - Image Import

I’d already decided I would only be displaying top 5 results for artists, so I could just spend 10 mins and run a Bing search with the ‘public domain’ filter set to find suitable images. Files were resized and cropped to a circle in paint.net. Manual but effective.

Design is obviously iterative; my top 5 artists were discovered during initial data exploration.

In order to include these in my dataset, I went with Chris Webb’s approach (see this blog post on including images in Power BI datasets). The limit of 2M chars per DAX measure wouldn’t be an issue as my images will be less than 100K.

Here’s my very slightly modified version of Chris’s code:

let
    //Get list of files in folder
    Source = Folder.Files(DataFolder& "\Artist Images"),
    #"Filtered Rows" = Table.SelectRows(Source, each Text.EndsWith([Name], ".png")),
    //Remove unnecessary columns
    RemoveOtherColumns = Table.SelectColumns(#"Filtered Rows",{"Content", "Name"}),
    //Creates Splitter function
    SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
    //Converts table of files to list
    ListInput = Table.ToRows(RemoveOtherColumns),
    //Function to convert binary of photo to multiple
    //text values
    ConvertOneFile = (InputRow as list) =>
        let
            BinaryIn = InputRow{0},
            FileName = InputRow{1},
            BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
            SplitUpText = SplitTextFunction(BinaryText),
            AddFileName = List.Transform(SplitUpText, each {FileName,_})
        in
            AddFileName,
    //Loops over all photos and calls the above function
    ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
    //Combines lists together
    CombineLists = List.Combine(ConvertAllFiles),
    //Converts results to table
    ToTable = #table(type table[Name=text,Pic=text],CombineLists),
    //Adds index column to output table
    AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1),
    #"Duplicated Column" = Table.DuplicateColumn(AddIndexColumn, "Name", "Name - Copy"),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Duplicated Column", "Name - Copy", Splitter.SplitTextByEachDelimiter({"."}, QuoteStyle.Csv, true), {"Name - Copy.1", "Name - Copy.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Name - Copy.1", type text}, {"Name - Copy.2", type text}}),
    #"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Name - Copy.1", "Artist Name"}}),
    #"Removed Columns" = Table.RemoveColumns(#"Renamed Columns",{"Name - Copy.2"})
in
    #"Removed Columns"

And the resulting imported images:


Final Data Model

After this iteration, the final data model looks like this:


Fast Forward - The Final Result

After some initial playing with the data, I settled on a simple (and pretty bog standard) design, ending up with this.

Convention says to place visuals top left to bottom right in order of importance. I thought I’d try something slightly different by using BANs in the less-important quarter hoping it would balance the natural tendency to ignore this corner. Does it work? Maybe.

The custom Deneb visuals took a little time to get almost-right, which is probably why I wanted them on the left-hand side. Yes, I know, it’s def a case of “I spent ages building this vis, so it’s going in the most prominent position”.

Let’s take the report apart.

The DAX

There’s very little DAX in this report. Most measures are counts or sum. The only one of minor interest takes the Stream Time Millis field and converts it to hours.mintues instead of decimal hours (1 hour 30 mins = 1.30 not 1.5 in other words) mostly because this makes more sense to me when displaying (although maybe 1:30 would be better still).

VAR decimalHours = [Stream Time Millis] / ( 60 * 60 * 1000 )
VAR hours =
    TRUNC ( decimalHours )
VAR decMins = ( decimalHours - hours )
RETURN
    hours + 0.6 * decMins

The Visuals

There are two Deneb visuals in this report. The first is a radial plot, and the second is the catchy titled “combo image bar stats chart”. All the others are standard Power BI visuals.

I also created a custom colour theme to “F to pay respects” to the Spotify look.

The Blobby Radial Plot

Well, I don’t know what else to call it. This one is straight forwards, Day Name from the Time dimension and Stream Time Hours from the Streams table.

{
  "data": {"name": "dataset"},
  "layer": [
    {
      "mark": {
        "type": "arc",
        "cornerRadius": 100,
        "padAngle": 0.02
      }
    },
    {
      "mark": {"type": "text", "radiusOffset": 20,
        "fontSize": 10},
      "encoding": {
        "text": {"field": "Day Name"},
        "color": {"value":"#FFFFFF"}
      }
    }
    ],
    "encoding": {
      "theta": {"field": "Day Name","stack": true, "sort":null },
      "radius": {"field": "Stream Time Hours", "scale": {
        "type": "sqrt", "zero": true, "rangeMin": 20}
      },
      "color": {"field": "Day Name","sort":null, "scale": {"scheme": "pbiColorNominal"}, "legend": null }
    }
}

I should probably go back and add cross-filtering and highlighting at some point.

The Summary Combo Image Bar Stats Chart

The awesomely titled vis is really an exercise in self-indulgent study of Deneb. I’m still very much at the “How on earth do you make this work?” stage, so this is not all a case study in Vega-lite best practice. It’s more an exercise in not admitting defeat whilst climbing that learning curve.

Marks:

  • Bar: The main green bar
  • Image: The artist image axis
  • Text: Artist name label
  • Text: Stream time label
  • Text: Stream count label

The image axis was based on a Stack Overflow answer which seemed to work for the the text label positioning too. There’s probably a more elegant way to do this, and positioning of elements relative to band height especially needs fixing to make the vis responsive.

  "data": {"name": "dataset"},
  "layer": [
    {
      "mark": {
        "type": "bar",
        "tooltip": true,
        "cornerRadius": 10,
        "height": {"band": 0.3},
        "yOffset": 12
      }
    },
    {
      "transform": [
        {
          "calculate": "-6",
          "as": "axis"
        }
      ],
      "mark": {
        "type": "image",
        "height": 40,
        "width": 40
      },
      "encoding": {
        "x": {
          "field": "axis",
          "type": "quantitative"
        },
        "url": {
          "field": "Artist Image",
          "type": "nominal"
        }
      }
    },
    {
      "transform": [
        {"calculate": "0", "as": "axis"}
      ],
      "mark": {
        "type": "text",
        "align": "left"
      },
      "encoding": {
        "text": {
          "field": "Artist Name"
        },
        "x": {
          "field": "axis",
          "type": "quantitative"
        }
      }
    },
    {
      "transform": [
        {
          "calculate": "25",
          "as": "axis"
        },
        {
      "calculate": " format(datum['Stream Time Hours'],'.2f') + ' hours'",
      "as": "textValue"
    }
      ],
      "mark": {
        "type": "text",
        "align": "left",
        "color": {"expr": "pbiColor(1)"}
      },
      "encoding": {
        "text": {
          "field": "textValue"
        },
        "x": {
          "field": "axis",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "text",
        "fontSize": 18,
        "fontWeight": "bold",
        "align": "left",
        "dy": 10,
        "dx": 5
      },
      "encoding": {
        "text": {
          "field": "Count Streams"
        }
      }
    }
  ],
  "encoding": {
    "y": {
      "field": "Artist Name",
      "type": "nominal",
      "axis": null,
      "sort": null
    },
    "x": {
      "field": "Count Streams",
      "type": "quantitative",
      "axis": null
    }
  },
  "resolve": {"scale": {"x": "shared"}}
}

As with the radial plot, I should probably go back and add cross-filtering and highlighting.


Storytime

Finally I added a little commentary to the report, reflecting the data points that jumped out at me.

Biggest take-away? I really shouldn’t troll my partner so much with TikTok meme sound tracks. They don’t even own a Subuwu.

Unwrapped Wrapped Up

This was an enjoyable hands-on side project with some interesting learning along the way. It’s not often I get the chance to do a complete end-to-end data prep, exploratory, visualisation and report building exercise just for funsies.

And Kylie and Taylor Swift are my most streamed artists? Who knew?