HyperAIHyperAI

Command Palette

Search for a command to run...

منذ عام واحد

موسيقى سكتش نت: توليد موسيقى قابل للتحكم من خلال التمثيلات المفككة للنغمة والإيقاع

Ke Chen Cheng-i Wang Taylor Berg-Kirkpatrick Shlomo Dubnov

نشر DiffRhythm بنقرة واحدة: توليد عرض موسيقي كامل في دقيقة واحدة

20 ساعة فقط من موارد حوسبة RTX 5090 $1 (قيمة $7)
الانتقال إلى دفتر

الملخص

استناداً إلى تشبيه مع أنظمة إكمال الصور التلقائية، نقترح Music SketchNet، وهو إطار عمل للشبكات العصبية يسمح للمستخدمين بتحديد أفكار موسيقية جزئية توجه التوليد التلقائي للموسيقى. نركز على توليد المقاطع المفقودة في القطع الموسيقية أحادية الصوت غير المكتملة، مشروطاً بالسياق المحيط، وموجهاً اختياريًا بشظايا النغم والإيقاع المحددة من قبل المستخدم. أولاً، نقدم SketchVAE، وهو مشفر تلقائي متغير جديد يفصل بشكل صريح بين الإيقاع ومنحنى النغم لتشكيل أساس النموذج المقترح. ثم نقدم بنية تمييزيتين، SketchInpainter وSketchConnector، اللتين تعملان معاً لإكمال الموسيقى الموجه، حيث تملئان التمثيلات الخاصة بالمقاطع المفقودة مشروطاً بالسياق المحيط والشظايا المحددة من قبل المستخدم. نقوم بتقييم SketchNet على مجموعة بيانات قياسية من الموسيقى الأيرلندية الشعبية ونقارنه مع نماذج من أعمال حديثة. عند استخدامه لإكمال الموسيقى، يتفوق نهجنا على أحدث التقنيات سواء من حيث المقاييس الموضوعية أو اختبارات الاستماع الذاتية. وأخيراً، نوضح أن نموذجنا يمكنه دمج بنجاح الشظايا المحددة من قبل المستخدم أثناء عملية التوليد.

One-sentence Summary

Music SketchNet is a controllable music generation framework that employs the SketchVAE to factorize pitch contours and rhythm, while the SketchInpainter and SketchConnector architectures guide the completion of missing measures in monophonic pieces conditioned on surrounding context and optional user-specified snippets, achieving state-of-the-art performance on a standard Irish folk music dataset across both objective metrics and subjective listening tests.

Key Contributions

  • The paper introduces Music SketchNet, a neural network framework that generates missing measures in incomplete monophonic pieces by conditioning on surrounding context and optional user-specified pitch and rhythm sketches.
  • The architecture employs SketchVAE to explicitly factorize latent representations into decoupled pitch contour and rhythm components, which are processed by SketchInpainter and SketchConnector modules to integrate user inputs during generation.
  • Evaluations on a standard Irish folk music dataset demonstrate that the framework outperforms recent state-of-the-art models in both objective metrics and subjective listening tests while successfully accommodating optional user specifications.

Introduction

Neural networks have significantly advanced automatic music generation, yet enabling intuitive user control remains critical for practical creative applications. Prior conditional generation methods typically require users to provide complete musical tracks or rely on basic note constraints, while existing inpainting techniques lack explicit user guidance. The authors leverage a sketching paradigm adapted from computer vision to introduce Music SketchNet, a framework that decouples musical measures into distinct pitch contour and rhythm latent variables using a factorized variational autoencoder. This architecture allows creators to specify partial musical ideas, which the model seamlessly integrates with context-aware predictions to produce highly controllable compositions.

Method

The proposed framework, Music SketchNet, is designed to enable guided music generation by completing missing measures in monophonic musical pieces based on surrounding context and optional user-specified pitch and rhythm snippets. The architecture is built upon three core components: SketchVAE, SketchInpainter, and SketchConnector, which together form a hierarchical system for encoding, predicting, and refining latent representations of music.

At the foundation of the system is the SketchVAE, which serves as a factorized variational autoencoder for music representation. It explicitly separates pitch and rhythm information into distinct latent dimensions, zpitchz_{pitch}zpitch and zrhythmz_{rhythm}zrhythm, allowing for independent control and manipulation. The SketchVAE processes a music measure by first encoding it into two separate token sequences: xpitchx_{pitch}xpitch, which contains note values with padding to fill a 24-frame measure, and xrhythmx_{rhythm}xrhythm, which captures duration and onset information by replacing pitch events with rhythm-specific tokens. These sequences are then independently encoded using two separate GRU-based encoders—QθQ_{\theta}Qθ for pitch and QτQ_{\tau}Qτ for rhythm—whose outputs are concatenated to form the full latent variable z=[zpitch,zrhythm]z = [z_{pitch}, z_{rhythm}]z=[zpitch,zrhythm]. The hierarchical decoder PϕP_{\phi}Pϕ reconstructs the original music from this latent representation by first decoding the measure into beats using a "beat" GRU layer, and then decomposing each beat into individual ticks using a "tick" GRU layer. This two-stage decoding process ensures that the output is generated in a musically intuitive manner. Refer to the framework diagram for a visual overview of the SketchVAE structure, including its encoder and hierarchical decoder.

The SketchInpainter component is responsible for generating initial predictions for the missing measures by leveraging the surrounding musical context. It takes as input the latent sequences from the past and future contexts, ZpZ^{p}Zp and ZfZ^{f}Zf, and processes them separately for pitch and rhythm using two independent GRU groups. The final hidden states from these encoders, ht1h_t^1ht1 and ht3h_t^3ht3, are used as initial states for two separate generation GRUs—the pitch generation GRUs and rhythm generation GRUs. These generation GRUs predict the missing latent representations, Sm=(Spitch,Srhythm)S^{m} = (S_{pitch}, S_{rhythm})Sm=(Spitch,Srhythm), in an auto-regressive manner, conditioned on the contextual information. The predicted latent sequences are then fed into the SketchVAE decoder to produce a music output, which is used to compute the cross-entropy loss during training. This initial prediction step establishes a strong baseline for the missing music material.

To incorporate user guidance and refine the initial prediction, the SketchConnector module is introduced. This component modifies the latent predictions from the SketchInpainter by integrating user-specified sketching information, denoted as CCC. The process begins by concatenating the predicted SmS^{m}Sm with the past and future latent sequences, followed by a random unmasking procedure where a portion of the predicted latent variables (set at a 0.3 rate) are replaced with ground-truth values, simulating user-provided context. This mechanism, which is analogous to BERT-style training but with unmasking rather than masking, encourages the model to learn stronger correlations between adjacent musical elements. The unmasked data and the predicted SmS^{m}Sm are then fed into a transformer encoder with absolute positional encoding, which produces the final refined latent representation ZmZ^{m}Zm for the missing measures. This refined latent variable is subsequently decoded by the SketchVAE to generate the final music output. As shown in the figure below, the SketchConnector acts as a final refinement step, ensuring that the generated music aligns with both the contextual cues and the user's specified sketching input .

Experiment

The evaluation compares SketchNet against established baselines using objective generation metrics, subjective human listening tests, and an interactive control scenario to validate its inpainting capabilities. Objective and subjective results demonstrate that the model consistently outperforms competitors in pitch accuracy and overall musicality, effectively capturing repetitive patterns while maintaining contextual coherence in non-repetitive passages. Human listeners particularly appreciate the improved structural integrity and harmonic quality of the generated melodies, despite similar note complexity. Additionally, interactive experiments confirm that the system reliably follows user-specified pitch and rhythm sketches, proving its effectiveness as a highly controllable music generation tool.

The authors evaluate the SketchNet model by comparing it with baseline systems, focusing on pitch and rhythm accuracy in generated melodies. The experiment includes a virtual control test to assess how user-specified rhythm and pitch information influence the model's output. Results show that the model achieves high accuracy in following user control for both rhythm and pitch, with better performance in pitch following. SketchNet demonstrates superior performance compared to baseline models in generating melodies with accurate pitch and rhythm. The model effectively follows user-specified rhythm and pitch controls, with high accuracy in both cases. User control significantly influences the generated output, as shown by high accuracy in matching specified rhythm and pitch patterns.

The authors compare SketchNet with Music InpaintNet and other baselines, evaluating performance through objective metrics and subjective listening tests. Results show that SketchNet outperforms other models in generating melodies, particularly in structure and overall musicality, while maintaining comparable complexity. The model also demonstrates effective control over pitch and rhythm in interactive scenarios. SketchNet achieves higher structure and musicality scores compared to InpaintNet in subjective evaluations. The model shows improved performance over baselines in both repetition and non-repetition test subsets. User control over pitch and rhythm leads to targeted generation, with high accuracy in following specified inputs.

The authors evaluate SketchNet against several baselines, including Music InpaintNet and variations of SketchVAE, across different test subsets. Results show that SketchNet consistently outperforms other models in terms of both accuracy and loss, particularly in non-repetition scenarios where copying is not viable. The model demonstrates strong control over generated melodies through user-specified pitch and rhythm cues. SketchNet achieves the best performance across all test subsets, outperforming all baseline models in both accuracy and loss metrics. Performance improvements are more pronounced in pitch accuracy than rhythm accuracy, especially in non-repetition scenarios. User-controlled pitch and rhythm inputs significantly influence the generated output, with high accuracy in following specified controls.

The evaluation compares SketchNet against multiple baseline systems using objective metrics and subjective listening tests across repetition and non-repetition scenarios to validate its capacity for generating musically coherent melodies under user-defined pitch and rhythm constraints. Results consistently indicate that SketchNet surpasses competing approaches in structural integrity and overall musicality while maintaining precise adherence to user-specified controls. The model proves particularly effective in interactive generation tasks, where user cues significantly shape the output and yield stronger pitch alignment than rhythm tracking, especially within complex non-repetition contexts.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp