Command Palette
Search for a command to run...
重力の森域:モバイルデバイス向け標的型フィッシング攻撃におけるドメイン生成アルゴリズム(DGA)検出手法の比較分析
重力の森域:モバイルデバイス向け標的型フィッシング攻撃におけるドメイン生成アルゴリズム(DGA)検出手法の比較分析
Adam Dorian Wong John D. Hastings
概要
モバイル端末は、ドメイン生成アルゴリズム(DGA)を活用して敵対的インフラを随時転換する SMS 標的型フィッシング(スミッシング)リンクを通じて、サイバー犯罪アクターによって頻繁に標的にされています。しかし、DGA の研究および評価は主にマルウェアの C2 コマンド制御と電子メールフィッシングデータセットに焦点を当てており、エンタープライズ境界外でスミッシング駆動型ドメイン戦術に対して検出器がどの程度一般化できるかを示す実証的証拠は限定的です。本稿は、2022 年から 2025 年に配信されたスミッシングリンクから派生した新しい半合成データセット「Gravity Falls」を用いて、従来の手法および機械学習ベースの DGA 検出器を評価することで、この研究ギャップを埋めます。Gravity Falls データセットは、単一の攻撃アクターの進化を 4 つの戦術クラスターにわたり捉えており、短いランダム文字列から辞書連結、さらに認証情報の窃取や手数料・罰金詐欺に利用されるテーマ付きのコマンドドメインスクワッティング変種へと遷移する様子を可視化しています。評価には、ベナン(正常)ベースラインとしてトップ 1M ドメインを採用し、2 つの文字列解析アプローチ(シャノンエントロピーおよび Exp0se)および 2 つの機械学習ベース検出器(LSTM クラスフィアと COSSAS DGAD)が検証されました。その結果、検出性能は戦術に強く依存し、ランダム文字列ドメインに対しては最高パフォーマンスを示すものの、辞書連結およびテーマ付きスクワッティングドメインでは低下し、複数のツールとクラスターの組み合わせにおいて Recall(再現率)は低水準に留まりました。全体的に、従来のヒューリスティック手法および最新の機械学習検出器とも、Gravity Falls で観測されるように継続的に進化する DGA 戦術に一貫して対応するには不適切であり、より文脈 Aware なアプローチの必要性を浮き彫りにするとともに、将来の評価に向けた再現可能なベンチマークを提供するものです。
One-sentence Summary
Adam Dorian Wong and John D. Hastings of Dakota State University introduce Gravity Falls, a semi-synthetic smishing-derived DGA dataset spanning 2022–2025, revealing that both traditional heuristics and ML detectors (including LSTM and COSSAS DGAD) fail against evolving tactics like themed combo-squatting, urging context-aware defenses for mobile threat landscapes.
Key Contributions
- The paper introduces Gravity Falls, a new semi-synthetic DGA dataset derived from real-world SMS spearphishing campaigns (2022–2025), capturing a threat actor’s evolving tactics across four technique clusters—from randomized strings to themed combo-squatting—filling a gap in mobile-targeted DGA research previously dominated by malware C2 and email datasets.
- It evaluates four DGA detectors (Shannon entropy, Exp0se, LSTM, COSSAS DGAD) against Gravity Falls using Top-1M domains as benign baselines, revealing that all methods struggle with dictionary-based and themed domains, showing tactic-dependent performance and low recall in multiple tool-cluster pairings.
- The findings demonstrate that both traditional heuristics and recent ML-based detectors are ill-suited for the dynamic, context-rich DGA patterns in smishing, motivating context-aware detection methods and providing a reproducible benchmark for future evaluation of mobile threat infrastructure.
Introduction
The authors leverage the Gravity Falls dataset—a semi-synthetic collection of smishing-driven DGA domains from 2022 to 2025—to evaluate how well traditional and machine-learning DGA detectors perform against real-world, evolving attack tactics outside enterprise networks. While prior work focuses on malware C2 or email phishing, smishing targets individuals with fewer protections and rapidly rotating domains, making detection critical yet understudied. The authors find that both entropy-based heuristics and modern ML models like LSTM and COSSAS DGAD struggle with dictionary concatenation and themed combo-squatting variants, revealing a gap in detector adaptability to tactic shifts. Their main contribution is a new benchmark dataset and evidence that current tools are insufficient for smishing-specific DGA evolution, urging context-aware detection methods.
Dataset

-
The authors use the Gravity Falls dataset, composed of C2 domains delivered via SMS between 2022 and 2025, organized into four technique clusters reflecting annual evolution of the same threat actor’s TTPs. The data is semi-synthetic, blending observed malicious domains with predicted ones used for sinkholing and measurement.
-
Each cluster has distinct characteristics:
- Cats Cradle (2022): Short randomized 7-character domains with common TLDs; landing pages mimicked CAPTCHA portals.
- Double Helix (2023): Dictionary-based concatenations with newer gTLDs; occasional truncations suggest encoding constraints.
- Pandoras Box (2024): Professional package-delivery lures; combo-squatting with random suffixes; heavy use of Chinese infrastructure.
- Easy Rider (2025): Government/toll-themed lures; shifted to email-to-iMessage/SMS with foreign numbers; combo-squatting stabilized.
-
Control groups (10,000 domains each) were drawn from Alexa, Cisco, Cloudflare, and Majestic Top-1M lists (2017–2025), treated as benign baselines. Experimental groups combined 5,000 malicious domains from each cluster with 5,000 from Alexa Top-1M to maintain consistent size; Alexa was used for padding due to its static nature.
-
Data was collected via recipient-side SMS observation, followed by WHOIS lookups (via DomainTools), passive DNS queries (SecurityTrails), and URL snapshots (URLscan). From 2024 onward, Iris Investigate replaced manual workflows, enabling link graphs and structured CSV exports. IOCs were initially shared via OTX, later migrated to GitHub with curation to avoid platform suspensions.
-
For model evaluation, domains were randomized using Claude AI scripts, fed into tools in order (Control A–D, then Experimental A–D), with malicious samples stacked before benign ones to test for potential model assimilation. No explicit cropping or metadata construction beyond tool outputs was applied, though future work suggests retroactive standardization via DomainTools for higher fidelity.
Method
The authors leverage two distinct CAPTCHA generation techniques to evaluate target validation mechanisms, each designed to simulate human-like input patterns while introducing controlled randomness to thwart automated systems.
In the first approach, Cats Cradle (2022), the system generates randomized sequences of alphabetical characters constrained to lengths between five and eight characters. This method relies on the perceptual unpredictability of letter arrangements to challenge automated solvers, while maintaining a structure that remains legible and interpretable to human users. The technique does not enforce semantic meaning, instead prioritizing visual and typographic variability as a barrier to machine recognition.
The second method, Double Helix (2023), adopts a more linguistically grounded strategy by concatenating pairs of dictionary words. This dual-word structure preserves semantic coherence while increasing combinatorial complexity, making it harder for bots to guess or brute-force valid inputs. The authors assess both techniques under the same objective: validating target systems through the deployment of fake CAPTCHAs that mimic real-world adversarial conditions.
No architectural diagrams or training workflows are provided in the source material; the focus remains on the design and intent of the CAPTCHA generation strategies rather than their implementation or evaluation infrastructure.
Experiment
- Evaluated four domain-generation tactics (Cats Cradle, Double Helix, Pandoras Box, Easy Rider) using traditional and ML-based detectors, revealing strong performance only on randomized domains (Cats Cradle) and poor detection on dictionary-based or combo-squatting variants.
- Traditional detectors like Exp0se excelled at high-entropy domains but struggled with structured, dictionary-driven tactics, confirming their role as high-throughput sieves rather than comprehensive solutions.
- ML-based tools (LSTM, DGAD) showed limited generalization beyond randomized domains, indicating current models are not robust against blended, real-world smishing tactics that mix brand tokens and minor randomization.
- Defenders should adopt layered strategies: use lexical heuristics for obvious random domains, and supplement with contextual signals (message content, infrastructure, brand abuse policies) for more sophisticated tactics.
- LLMs demonstrated potential in identifying thematic patterns across clusters, suggesting future integration could enhance detection capabilities.
- Experimental limitations include semi-synthetic data, sampling duplicates, skewed benign/malicious ratios, and outdated benign baselines, all of which constrain generalizability and should be addressed in future work.
The authors evaluate four domain detection methods across four distinct domain-generation tactics, finding that performance varies significantly by tactic type. Traditional and ML-based detectors achieve high precision and accuracy on randomized domains but struggle with dictionary-based and themed combo-squatting domains. Results indicate that current tools are not robust against real-world smishing tactics that blend recognizable words with minor randomization.
