How Convolutional Neural Networks Learn Musical Similarity Through Audio Embeddings for Smart Music Recommendations
Convolutional Neural Networks (CNNs) play a key role in learning musical similarity by transforming raw audio into meaningful representations that capture the essence of music. Streaming platforms like Spotify and Apple Music rely on these representations—called audio embeddings—to power personalized song recommendations. Unlike traditional methods that depend on metadata or user behavior, audio embeddings enable systems to understand music based on its actual sound, allowing for more accurate and nuanced recommendations. The process begins by converting raw audio files, such as MP3s, into mel-spectrograms. These are 2D visualizations that represent the frequency content of sound over time, scaled to match human auditory perception. The x-axis shows time, the y-axis shows mel-scaled frequency bands, and the intensity of each pixel reflects the energy in that frequency band at that moment. Brighter areas indicate louder or more active sounds, such as sustained vocals or strings, while sharp vertical lines often correspond to percussive hits like snare drums. Instead of training on entire songs, the model processes small, randomly sampled chunks of these spectrograms—typically 128 time steps by 129 frequency bands. This approach ensures the network learns local musical features like timbre, rhythm, and texture rather than being biased toward specific moments in a track. By using short clips, the model sees different parts of the same song across training epochs, reducing overfitting and improving generalization. To train the model without labels, a contrastive learning strategy is employed. The key idea is to teach the model that two different augmented versions of the same audio chunk should have very similar embeddings, while embeddings from different songs should be distinct. This is achieved using the InfoNCE loss function, which encourages the model to pull similar audio representations closer together in the embedding space while pushing dissimilar ones apart. Each batch of 8 audio chunks is augmented with small random noise, creating two slightly different views of the same data. The model generates embeddings for both views, which are then normalized using L2 normalization. The similarity between each embedding in the first view and every embedding in the second is computed using cosine similarity, scaled by a temperature parameter. The loss is calculated as a softmax cross-entropy, where the correct match (same song) is emphasized in the numerator, and all other similarities are summed in the denominator. Minimizing this loss shapes the embedding space so that similar sounds cluster together. The CNN architecture used is designed to extract hierarchical features from the mel-spectrograms. The first layer applies 32 small filters to detect basic patterns like note onsets or transient sounds. Batch normalization and max pooling help stabilize training and reduce sensitivity to small shifts. The second layer increases the number of filters to 64, allowing the model to detect more complex structures like rhythmic loops or consistent timbral qualities. The third layer uses 128 filters to capture high-level features such as overall spectral balance or instrument-like textures. After three convolutional blocks, global average pooling reduces each feature map to a single value, summarizing the presence of key patterns regardless of their position. A dense layer then maps this summary into a 128-dimensional embedding vector. Finally, the embedding is normalized to lie on a unit sphere, enabling efficient comparison using cosine similarity. To evaluate the quality of the learned embeddings, dimensionality reduction techniques like PCA and t-SNE are applied. PCA reveals the global structure of the embedding space, showing that genres are not rigidly separated but form a continuous, smooth manifold—indicating that the model captures subtle variations in music. t-SNE highlights local neighborhoods, showing that songs of the same genre tend to cluster together, even if they overlap with other genres. This suggests the embeddings are effective for both global and local similarity tasks. These embeddings can be used in a practical recommendation system. For example, a simple web app can take an uploaded MP3, generate its mel-spectrogram, extract an embedding, and retrieve the most similar tracks using cosine similarity. Precomputed embeddings from a dataset like FMA Small allow fast, real-time recommendations. The system can aggregate chunk-level embeddings into a single song-level representation, making it robust to variations in track length. In real-world applications, such audio embeddings are combined with collaborative filtering and other ranking models to create hybrid recommendation systems. Audio embeddings capture what songs sound like, while collaborative filtering captures what users like. Together, they provide a powerful balance between acoustic similarity and personal preference, leading to more relevant and engaging recommendations.
