HyperAIHyperAI

Command Palette

Search for a command to run...

il y a un an

Music SketchNet : Génération de musique contrôlable via des représentations factorisées de la hauteur et du rythme

Ke Chen Cheng-i Wang Taylor Berg-Kirkpatrick Shlomo Dubnov

Déploiement en un clic de DiffRhythm : Générez une démo musicale complète en 1 minute

20 heures de calcul sur RTX 5090 pour seulement $1 (valeur $7)
Aller à Notebook

Résumé

En tirant une analogie avec les systèmes de complétion d’images automatiques, nous proposons Music SketchNet, un cadre de réseau neuronal permettant aux utilisateurs de spécifier des idées musicales partielles guidant la génération automatique de musique. Nous nous concentrons sur la génération des mesures manquantes dans des pièces musicales monophoniques incomplètes, conditionnées par le contexte environnant, et éventuellement guidées par des extraits de hauteur et de rythme spécifiés par l’utilisateur. Premièrement, nous introduisons SketchVAE, un autoencodeur variationnel novateur qui factorise explicitement le rythme et le contour de la hauteur pour constituer la base de notre modèle proposé. Ensuite, nous présentons deux architectures discriminatives, SketchInpainter et SketchConnector, qui, conjointement, assurent la complétion musicale guidée, en remplissant les représentations des mesures manquantes conditionnées par le contexte environnant et les extraits spécifiés par l’utilisateur. Nous évaluons SketchNet sur un jeu de données standard de musique folklorique irlandaise et le comparons aux modèles issus de travaux récents. Lorsqu’il est utilisé pour la complétion musicale, notre approche surpasse l’état de l’art tant en termes de métriques objectives que de tests d’écoute subjectifs. Enfin, nous démontrons que notre modèle peut intégrer avec succès les extraits spécifiés par l’utilisateur pendant le processus de génération.

One-sentence Summary

Music SketchNet is a controllable music generation framework that employs the SketchVAE to factorize pitch contours and rhythm, while the SketchInpainter and SketchConnector architectures guide the completion of missing measures in monophonic pieces conditioned on surrounding context and optional user-specified snippets, achieving state-of-the-art performance on a standard Irish folk music dataset across both objective metrics and subjective listening tests.

Key Contributions

  • The paper introduces Music SketchNet, a neural network framework that generates missing measures in incomplete monophonic pieces by conditioning on surrounding context and optional user-specified pitch and rhythm sketches.
  • The architecture employs SketchVAE to explicitly factorize latent representations into decoupled pitch contour and rhythm components, which are processed by SketchInpainter and SketchConnector modules to integrate user inputs during generation.
  • Evaluations on a standard Irish folk music dataset demonstrate that the framework outperforms recent state-of-the-art models in both objective metrics and subjective listening tests while successfully accommodating optional user specifications.

Introduction

Neural networks have significantly advanced automatic music generation, yet enabling intuitive user control remains critical for practical creative applications. Prior conditional generation methods typically require users to provide complete musical tracks or rely on basic note constraints, while existing inpainting techniques lack explicit user guidance. The authors leverage a sketching paradigm adapted from computer vision to introduce Music SketchNet, a framework that decouples musical measures into distinct pitch contour and rhythm latent variables using a factorized variational autoencoder. This architecture allows creators to specify partial musical ideas, which the model seamlessly integrates with context-aware predictions to produce highly controllable compositions.

Method

The proposed framework, Music SketchNet, is designed to enable guided music generation by completing missing measures in monophonic musical pieces based on surrounding context and optional user-specified pitch and rhythm snippets. The architecture is built upon three core components: SketchVAE, SketchInpainter, and SketchConnector, which together form a hierarchical system for encoding, predicting, and refining latent representations of music.

At the foundation of the system is the SketchVAE, which serves as a factorized variational autoencoder for music representation. It explicitly separates pitch and rhythm information into distinct latent dimensions, zpitchz_{pitch}zpitch and zrhythmz_{rhythm}zrhythm, allowing for independent control and manipulation. The SketchVAE processes a music measure by first encoding it into two separate token sequences: xpitchx_{pitch}xpitch, which contains note values with padding to fill a 24-frame measure, and xrhythmx_{rhythm}xrhythm, which captures duration and onset information by replacing pitch events with rhythm-specific tokens. These sequences are then independently encoded using two separate GRU-based encoders—QθQ_{\theta}Qθ for pitch and QτQ_{\tau}Qτ for rhythm—whose outputs are concatenated to form the full latent variable z=[zpitch,zrhythm]z = [z_{pitch}, z_{rhythm}]z=[zpitch,zrhythm]. The hierarchical decoder PϕP_{\phi}Pϕ reconstructs the original music from this latent representation by first decoding the measure into beats using a "beat" GRU layer, and then decomposing each beat into individual ticks using a "tick" GRU layer. This two-stage decoding process ensures that the output is generated in a musically intuitive manner. Refer to the framework diagram for a visual overview of the SketchVAE structure, including its encoder and hierarchical decoder.

The SketchInpainter component is responsible for generating initial predictions for the missing measures by leveraging the surrounding musical context. It takes as input the latent sequences from the past and future contexts, ZpZ^{p}Zp and ZfZ^{f}Zf, and processes them separately for pitch and rhythm using two independent GRU groups. The final hidden states from these encoders, ht1h_t^1ht1 and ht3h_t^3ht3, are used as initial states for two separate generation GRUs—the pitch generation GRUs and rhythm generation GRUs. These generation GRUs predict the missing latent representations, Sm=(Spitch,Srhythm)S^{m} = (S_{pitch}, S_{rhythm})Sm=(Spitch,Srhythm), in an auto-regressive manner, conditioned on the contextual information. The predicted latent sequences are then fed into the SketchVAE decoder to produce a music output, which is used to compute the cross-entropy loss during training. This initial prediction step establishes a strong baseline for the missing music material.

To incorporate user guidance and refine the initial prediction, the SketchConnector module is introduced. This component modifies the latent predictions from the SketchInpainter by integrating user-specified sketching information, denoted as CCC. The process begins by concatenating the predicted SmS^{m}Sm with the past and future latent sequences, followed by a random unmasking procedure where a portion of the predicted latent variables (set at a 0.3 rate) are replaced with ground-truth values, simulating user-provided context. This mechanism, which is analogous to BERT-style training but with unmasking rather than masking, encourages the model to learn stronger correlations between adjacent musical elements. The unmasked data and the predicted SmS^{m}Sm are then fed into a transformer encoder with absolute positional encoding, which produces the final refined latent representation ZmZ^{m}Zm for the missing measures. This refined latent variable is subsequently decoded by the SketchVAE to generate the final music output. As shown in the figure below, the SketchConnector acts as a final refinement step, ensuring that the generated music aligns with both the contextual cues and the user's specified sketching input .

Experiment

The evaluation compares SketchNet against established baselines using objective generation metrics, subjective human listening tests, and an interactive control scenario to validate its inpainting capabilities. Objective and subjective results demonstrate that the model consistently outperforms competitors in pitch accuracy and overall musicality, effectively capturing repetitive patterns while maintaining contextual coherence in non-repetitive passages. Human listeners particularly appreciate the improved structural integrity and harmonic quality of the generated melodies, despite similar note complexity. Additionally, interactive experiments confirm that the system reliably follows user-specified pitch and rhythm sketches, proving its effectiveness as a highly controllable music generation tool.

The authors evaluate the SketchNet model by comparing it with baseline systems, focusing on pitch and rhythm accuracy in generated melodies. The experiment includes a virtual control test to assess how user-specified rhythm and pitch information influence the model's output. Results show that the model achieves high accuracy in following user control for both rhythm and pitch, with better performance in pitch following. SketchNet demonstrates superior performance compared to baseline models in generating melodies with accurate pitch and rhythm. The model effectively follows user-specified rhythm and pitch controls, with high accuracy in both cases. User control significantly influences the generated output, as shown by high accuracy in matching specified rhythm and pitch patterns.

The authors compare SketchNet with Music InpaintNet and other baselines, evaluating performance through objective metrics and subjective listening tests. Results show that SketchNet outperforms other models in generating melodies, particularly in structure and overall musicality, while maintaining comparable complexity. The model also demonstrates effective control over pitch and rhythm in interactive scenarios. SketchNet achieves higher structure and musicality scores compared to InpaintNet in subjective evaluations. The model shows improved performance over baselines in both repetition and non-repetition test subsets. User control over pitch and rhythm leads to targeted generation, with high accuracy in following specified inputs.

The authors evaluate SketchNet against several baselines, including Music InpaintNet and variations of SketchVAE, across different test subsets. Results show that SketchNet consistently outperforms other models in terms of both accuracy and loss, particularly in non-repetition scenarios where copying is not viable. The model demonstrates strong control over generated melodies through user-specified pitch and rhythm cues. SketchNet achieves the best performance across all test subsets, outperforming all baseline models in both accuracy and loss metrics. Performance improvements are more pronounced in pitch accuracy than rhythm accuracy, especially in non-repetition scenarios. User-controlled pitch and rhythm inputs significantly influence the generated output, with high accuracy in following specified controls.

The evaluation compares SketchNet against multiple baseline systems using objective metrics and subjective listening tests across repetition and non-repetition scenarios to validate its capacity for generating musically coherent melodies under user-defined pitch and rhythm constraints. Results consistently indicate that SketchNet surpasses competing approaches in structural integrity and overall musicality while maintaining precise adherence to user-specified controls. The model proves particularly effective in interactive generation tasks, where user cues significantly shape the output and yield stronger pitch alignment than rhythm tracking, especially within complex non-repetition contexts.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp