Command Palette
Search for a command to run...
Music SketchNet: ピッチとリズムの因子分解表現による制御可能な音楽生成
Music SketchNet: ピッチとリズムの因子分解表現による制御可能な音楽生成
Ke Chen Cheng-i Wang Taylor Berg-Kirkpatrick Shlomo Dubnov
DiffRhythm のワンクリックデプロイ:1 分でフルミュージックデモを生成
概要
自動画像補完システムとの類推に基づき、ユーザーが自動音楽生成を導くための部分的な音楽的アイデアを指定できるニューラルネットワークフレームワーク「Music SketchNet」を提案する。本研究では、周囲の文脈に条件付けられ、必要に応じてユーザー指定のピッチおよびリズムの断片によって導かれる形で、不完全な単旋律音楽作品の欠落小節の生成に焦点を当てる。まず、リズムとピッチの輪郭を明示的に因子分解し、提案モデルの基盤を形成する新規な変分オートエンコーダー「SketchVAE」を紹介する。次に、SketchInpainterおよびSketchConnectorという2つの判別型アーキテクチャを導入し、これらを組み合わせることで、周囲の文脈およびユーザー指定の断片に条件付けられた欠落小節の表現を埋め込む、ガイド付き音楽補完を実現する。SketchNetをアイルランド民謡の標準データセットで評価し、最近の論文から報告されたモデルと比較する。音楽補完に用いた場合、本手法は客観的指標および主観的聴取テストの両面で最先端手法を上回る性能を示した。最後に、生成プロセス中にユーザー指定の断片をモデルが成功裏に取り込めることを実証する。
One-sentence Summary
Music SketchNet is a controllable music generation framework that employs the SketchVAE to factorize pitch contours and rhythm, while the SketchInpainter and SketchConnector architectures guide the completion of missing measures in monophonic pieces conditioned on surrounding context and optional user-specified snippets, achieving state-of-the-art performance on a standard Irish folk music dataset across both objective metrics and subjective listening tests.
Key Contributions
- The paper introduces Music SketchNet, a neural network framework that generates missing measures in incomplete monophonic pieces by conditioning on surrounding context and optional user-specified pitch and rhythm sketches.
- The architecture employs SketchVAE to explicitly factorize latent representations into decoupled pitch contour and rhythm components, which are processed by SketchInpainter and SketchConnector modules to integrate user inputs during generation.
- Evaluations on a standard Irish folk music dataset demonstrate that the framework outperforms recent state-of-the-art models in both objective metrics and subjective listening tests while successfully accommodating optional user specifications.
Introduction
Neural networks have significantly advanced automatic music generation, yet enabling intuitive user control remains critical for practical creative applications. Prior conditional generation methods typically require users to provide complete musical tracks or rely on basic note constraints, while existing inpainting techniques lack explicit user guidance. The authors leverage a sketching paradigm adapted from computer vision to introduce Music SketchNet, a framework that decouples musical measures into distinct pitch contour and rhythm latent variables using a factorized variational autoencoder. This architecture allows creators to specify partial musical ideas, which the model seamlessly integrates with context-aware predictions to produce highly controllable compositions.
Method
The proposed framework, Music SketchNet, is designed to enable guided music generation by completing missing measures in monophonic musical pieces based on surrounding context and optional user-specified pitch and rhythm snippets. The architecture is built upon three core components: SketchVAE, SketchInpainter, and SketchConnector, which together form a hierarchical system for encoding, predicting, and refining latent representations of music.
At the foundation of the system is the SketchVAE, which serves as a factorized variational autoencoder for music representation. It explicitly separates pitch and rhythm information into distinct latent dimensions, zpitch and zrhythm, allowing for independent control and manipulation. The SketchVAE processes a music measure by first encoding it into two separate token sequences: xpitch, which contains note values with padding to fill a 24-frame measure, and xrhythm, which captures duration and onset information by replacing pitch events with rhythm-specific tokens. These sequences are then independently encoded using two separate GRU-based encoders—Qθ for pitch and Qτ for rhythm—whose outputs are concatenated to form the full latent variable z=[zpitch,zrhythm]. The hierarchical decoder Pϕ reconstructs the original music from this latent representation by first decoding the measure into beats using a "beat" GRU layer, and then decomposing each beat into individual ticks using a "tick" GRU layer. This two-stage decoding process ensures that the output is generated in a musically intuitive manner. Refer to the framework diagram
for a visual overview of the SketchVAE structure, including its encoder and hierarchical decoder.
The SketchInpainter component is responsible for generating initial predictions for the missing measures by leveraging the surrounding musical context. It takes as input the latent sequences from the past and future contexts, Zp and Zf, and processes them separately for pitch and rhythm using two independent GRU groups. The final hidden states from these encoders, ht1 and ht3, are used as initial states for two separate generation GRUs—the pitch generation GRUs and rhythm generation GRUs. These generation GRUs predict the missing latent representations, Sm=(Spitch,Srhythm), in an auto-regressive manner, conditioned on the contextual information. The predicted latent sequences are then fed into the SketchVAE decoder to produce a music output, which is used to compute the cross-entropy loss during training. This initial prediction step establishes a strong baseline for the missing music material.
To incorporate user guidance and refine the initial prediction, the SketchConnector module is introduced. This component modifies the latent predictions from the SketchInpainter by integrating user-specified sketching information, denoted as C. The process begins by concatenating the predicted Sm with the past and future latent sequences, followed by a random unmasking procedure where a portion of the predicted latent variables (set at a 0.3 rate) are replaced with ground-truth values, simulating user-provided context. This mechanism, which is analogous to BERT-style training but with unmasking rather than masking, encourages the model to learn stronger correlations between adjacent musical elements. The unmasked data and the predicted Sm are then fed into a transformer encoder with absolute positional encoding, which produces the final refined latent representation Zm for the missing measures. This refined latent variable is subsequently decoded by the SketchVAE to generate the final music output. As shown in the figure below, the SketchConnector acts as a final refinement step, ensuring that the generated music aligns with both the contextual cues and the user's specified sketching input
.
Experiment
The evaluation compares SketchNet against established baselines using objective generation metrics, subjective human listening tests, and an interactive control scenario to validate its inpainting capabilities. Objective and subjective results demonstrate that the model consistently outperforms competitors in pitch accuracy and overall musicality, effectively capturing repetitive patterns while maintaining contextual coherence in non-repetitive passages. Human listeners particularly appreciate the improved structural integrity and harmonic quality of the generated melodies, despite similar note complexity. Additionally, interactive experiments confirm that the system reliably follows user-specified pitch and rhythm sketches, proving its effectiveness as a highly controllable music generation tool.
The authors evaluate the SketchNet model by comparing it with baseline systems, focusing on pitch and rhythm accuracy in generated melodies. The experiment includes a virtual control test to assess how user-specified rhythm and pitch information influence the model's output. Results show that the model achieves high accuracy in following user control for both rhythm and pitch, with better performance in pitch following. SketchNet demonstrates superior performance compared to baseline models in generating melodies with accurate pitch and rhythm. The model effectively follows user-specified rhythm and pitch controls, with high accuracy in both cases. User control significantly influences the generated output, as shown by high accuracy in matching specified rhythm and pitch patterns.
The authors compare SketchNet with Music InpaintNet and other baselines, evaluating performance through objective metrics and subjective listening tests. Results show that SketchNet outperforms other models in generating melodies, particularly in structure and overall musicality, while maintaining comparable complexity. The model also demonstrates effective control over pitch and rhythm in interactive scenarios. SketchNet achieves higher structure and musicality scores compared to InpaintNet in subjective evaluations. The model shows improved performance over baselines in both repetition and non-repetition test subsets. User control over pitch and rhythm leads to targeted generation, with high accuracy in following specified inputs.
The authors evaluate SketchNet against several baselines, including Music InpaintNet and variations of SketchVAE, across different test subsets. Results show that SketchNet consistently outperforms other models in terms of both accuracy and loss, particularly in non-repetition scenarios where copying is not viable. The model demonstrates strong control over generated melodies through user-specified pitch and rhythm cues. SketchNet achieves the best performance across all test subsets, outperforming all baseline models in both accuracy and loss metrics. Performance improvements are more pronounced in pitch accuracy than rhythm accuracy, especially in non-repetition scenarios. User-controlled pitch and rhythm inputs significantly influence the generated output, with high accuracy in following specified controls.
The evaluation compares SketchNet against multiple baseline systems using objective metrics and subjective listening tests across repetition and non-repetition scenarios to validate its capacity for generating musically coherent melodies under user-defined pitch and rhythm constraints. Results consistently indicate that SketchNet surpasses competing approaches in structural integrity and overall musicality while maintaining precise adherence to user-specified controls. The model proves particularly effective in interactive generation tasks, where user cues significantly shape the output and yield stronger pitch alignment than rhythm tracking, especially within complex non-repetition contexts.