منذ يوم واحد

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang

جدول المحتويات

الملخص

العنوان: [غير محدد]الملخص: لقد واجه النمو الأسي للمخرجات الأكاديمية العالمية الباحثين ووكلاء الذكاء الاصطناعي (AI agents) مع "انفجار معلوماتي" غير مسبوق، حيث تعيق التنظيمات المعرفية المجزأة وغير المهيكلة التكامل العميق عبر التخصصات. تعتمد أدوات الاسترجاع الأكاديمي الحالية بشكل رئيسي على مطابقة الكلمات المفتاحية السطحية أو الاسترجاع الدلالي في فضاء المتجهات، وهي أدوات تفتقر إلى قدرات الاستدلال الطوبولوجي اللازمة للتنقل عبر الروابط المنطقية المعقدة. وغالباً ما تكون الأطر القائمة على البحث العميق باستخدام الوكلاء (Agentic deep-research-based frameworks) عرضة للهلوسة المنطقية وتستهلك تكاليف استنتاجية عالية. لسد هذه الفجوة، نقدم في هذا التقرير "SciAtlas"، وهو مخطط معرفي ضخم الحجم، متعدد التخصصات، وغير متجانس للموارد الأكاديمية، مصمم كشبكة تطورية علمية شاملة. ومن خلال دمج أكثر من 43 مليون ورقة بحثية من 26 تخصصاً، وإجمالي 157 مليون كيان و3 مليارات زوجي (triplets)، يوفر SciAtlas ركيزة إدراكية طوبولوجية مهيكلة تكسر الحواجز بين التخصصات وتمنح وكلاء الذكاء الاصطناعي (AI agents) منظوراً عالمياً. علاوة على ذلك، قمنا بتطوير خوارزمية استرجاع عصبية-رمزية (neuro-symbolic) تتميز باسترجاع تعاوني ثلاثي المسارات وإعادة ترتيب بياني (graph reranking)، مما يحقق انتقالاً سلساً من المطابقة الدلالية البسيطة إلى اكتشاف الارتباطات الحتمية. كما نستعرض اتجاهات التطبيق الرئيسية لـ SciAtlas، بما في ذلك مراجعة الأدبيات، والتركيب الآلي لاتجاهات البحث، وتحديد الأفكار، واستكشاف المسارات الأكاديمية، لإثبات أن SciAtlas يمكن أن يعمل كـ "خريطة إدراكية" فعالة لتمكين الحلقة الكاملة للبحث العلمي الآلي مع تقليل تكاليف الاستدلال بشكل كبير. وقد قمنا بإتاحة واجهات الاسترجاع للمخطط المعرفي (KG) ومختلف المهام اللاحقة في مستودعنا على GitHub.

One-sentence Summary

The authors introduce SciATLAS, a large-scale heterogeneous knowledge graph that integrates 43 million papers across 26 disciplines into 157 million entities and 3 billion triplets, employing a neuro-symbolic retrieval algorithm with tri-path collaborative recall and graph reranking to enable deterministic association discovery, significantly reduce reasoning costs, and serve as a cognitive map for automated scientific research.

Key Contributions

SciATLAS, a large-scale heterogeneous academic knowledge graph, integrates 43 million papers across 26 disciplines into a topological network comprising 157 million entities and 3 billion triplets. This structured substrate dismantles disciplinary barriers and equips AI agents with a deterministic cognitive foundation for interdisciplinary research.
A neuro-symbolic retrieval algorithm utilizing tri-path collaborative recall and graph reranking transitions literature search from semantic matching to deterministic association discovery. This method anchors large language models with explicit graph traversal to mitigate logical hallucinations and lower the inference costs of deep-research agents.
The framework enables key automated research workflows, including literature review, trend synthesis, idea positioning, and academic trajectory exploration. Publicly released interfaces for knowledge graph retrieval and downstream tasks confirm its utility as a scalable cognitive map for end-to-end research automation.

Introduction

The exponential growth of global academic output has created an information explosion that impedes deep interdisciplinary integration and challenges the efficiency of automated scientific research workflows. Current retrieval mechanisms struggle to support this domain because they rely on flattened keyword matching or vector-space semantic search, which lack the topological reasoning necessary to navigate complex logical connections. Furthermore, agentic deep-research frameworks often incur prohibitive inference costs and suffer from logical hallucinations due to missing deterministic cognitive maps. The authors present SciATLAS, a massive heterogeneous knowledge graph integrating over 43 million papers across 26 disciplines to provide a structured topological substrate for scientific discovery. They leverage a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking to enable deterministic association discovery without iterative LLM calls. This approach allows AI agents to access a global cognitive perspective for tasks like idea positioning and trend synthesis while significantly reducing reasoning overhead.

Dataset

Source and Composition: The authors construct SciATLAS using OpenAlex as the foundational data source, which originally catalogs over 480 million academic publications. The knowledge graph centers on Papers and integrates interconnected entities including Authors, Institutions, Keywords, and a four-tier disciplinary hierarchy (Domains, Fields, Subfields, and Topics).
Scale and Filtering Rules: The finalized dataset contains 43.30 million papers, 109.70 million authors, 3.76 million keywords, and 0.12 million institutions. The filtering pipeline strictly retains English publications with sufficiently long abstracts and valid PDF URLs. It normalizes and deduplicates paper titles and institution names while intentionally preserving author duplicates to handle naming ambiguity. Records lacking critical attributes are removed.
Metadata Construction and Processing: To replace OpenAlex's sparse macroscopic concepts, the authors employ a lightweight LLM to extract three to eight reusable core keywords per paper from abstracts. Each keyword receives an importance score, and co-occurrence edges are weighted by frequency to capture conceptual links. The pipeline also generates semantic vectors using bge-large-en-v1.5 for titles, abstracts, and keywords, storing them directly as node attributes to enable hybrid retrieval.
Usage and Integration: The processed graph is deployed in Neo4j and organized across four relational levels: semantic (citations and relevance), conceptual (keyword co-occurrence), directional (disciplinary hierarchy), and social (authorship and institutional affiliations). Rather than relying on traditional training splits or mixture ratios, the authors leverage the knowledge graph for topological search and reasoning. They feed chronologically ordered paper sequences and author publication lists into structured LLM prompts to generate JSON outputs for research trend prediction and academic profiling.

Method

The retrieval system is designed to support a wide range of query types, including keywords, scientific questions, abstracts, idea texts, and complete papers, by mapping them into the knowledge graph (KG) through multiple distinct pathways. The framework begins with node matching, where queries are processed to identify candidate entities. For keyword-based queries, an LLM extracts a list of keywords along with their importance scores, forming a set $\mathcal{K} = \{(k_i, s_i^{\text{llm}})\}_{i=1}^{m}$ . These keywords undergo exact text matching and vector-based semantic matching against the KG. For exact matches, the score is directly assigned as $s_i^{\text{llm}}$ , while for vector matches, the score is computed as $s_i^{\text{llm}} \cdot \text{sim}(k_i, \mathbf{g})$ , retaining nodes only if the similarity exceeds a threshold $\theta_{kw}$ . The final weight for each keyword node $g$ is the maximum of all its matching scores, resulting in the set $\mathcal{K}_{\text{seed}} = \{(g, w_g^{kw})\}$ .

For semantic matching, the query $q$ is embedded into a vector $\mathbf{e}_q$ , and the top-60 papers are retrieved based on title and abstract embeddings. A reranker re-ranks these candidates, retaining the top-15 from each source. The final score for each paper $p$ is a weighted combination of its title and abstract retrieval scores, normalized to handle missing values. Title matching is specifically applied when the query contains paper titles. GROBID extracts titles, and an LLM assigns confidence scores to each. These titles are normalized and matched against the KG using exact or fuzzy similarity, with a threshold $\theta_{\text{title}}$ for filtering. The matching score for a paper $p$ is $c_j \cdot m(t_j, p)$ , where $m(t_j, p)$ combines LCS and token overlap. Multiple title matches are resolved by taking the maximum score.

The results from the semantic and title matching pathways are merged into a unified set of candidate paper nodes, $\mathcal{P}_{\text{seed}}$ . To unify the scores, a dot product between the query embedding and the paper's title and abstract embeddings is computed, followed by MinMax normalization. The final pre-graph weight for each paper $p$ is defined as $s_p^{pre} = \lambda_{emb} \widetilde{s}_p^{emb} + \lambda_{title} \widetilde{s}_p^{title} + b_p^{pre}$ , where $b_p^{pre}$ is a title bonus based on exact or fuzzy title hits. This process establishes the initial seed nodes for the retrieval.

The system then performs a 2-hop subgraph propagation from the seed nodes, treating all edges as undirected. To manage scalability, at most 500 nodes per entity type are selected at each hop. Paper importance is computed based on citation count using a logarithmic scaling to prevent dominance by highly cited papers. The unnormalized weight for each seed paper $p$ is $w_p^{seed} = s_p^{pre} \cdot (1 + \gamma \cdot \text{imp}(p))$ , where $\gamma$ controls the influence of importance. For seed keywords, the weight is $w_{\vec{q}}^{seed} = w_{\vec{q}}^{kw}$ . The initial distribution $\mathbf{s}$ over nodes is defined as $s_v = w_v^{seed} / Z$ for nodes in the seed set $S = \mathcal{P}_{\text{seed}} \cup \mathcal{K}_{\text{seed}}$ , with $Z$ as the normalization constant. Edge weights are assigned based on type, as defined in the table.

To explore topological relationships, a random walk with restart is performed on the graph. The transition probability from node $u$ to neighbor $v$ is $\omega(u, v) / \sum_{x \in N(u)} \omega(u, x)$ . The score vector $\mathbf{r}^{(t)}$ is initialized as $\mathbf{s}$ and updated iteratively as $r_v^{(t+1)} = \alpha s_v + (1 - \alpha) \sum_u r_u^{(t)} P(v \mid u)$ , where $\alpha$ is the restart probability. The process terminates when the $L_1$ norm of the difference between consecutive iterations falls below $10^{-6}$ or after 50 iterations. The final node score $r_v$ is the result of this diffusion.

Finally, the system computes a comprehensive final score for each paper $p$ as $s_p^{final} = \min(1, \lambda_{pre} \tilde{s}_p^{pre} + \lambda_{graph} \tilde{s}_p^{graph} g_p + \lambda_{imp} \text{imp}_{final}(p))$ . The pre-graph score $\tilde{s}_p^{pre}$ is MinMax-normalized, and the graph score $\tilde{s}_p^{graph}$ is similarly normalized. The graph support factor $g_p = \max(0.25, \tilde{s}_p^{pre})$ acts as a gate, ensuring that graph-discovered papers must have sufficient initial relevance to achieve high ranks. The final score combines initial relevance, topological support, and citation importance, and the top-20 papers are returned with detailed explanations.

ملف PDF المصدر عرض الكود

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ يوم واحد

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

SciATLAS, a large-scale heterogeneous academic knowledge graph, integrates 43 million papers across 26 disciplines into a topological network comprising 157 million entities and 3 billion triplets. This structured substrate dismantles disciplinary barriers and equips AI agents with a deterministic cognitive foundation for interdisciplinary research.
A neuro-symbolic retrieval algorithm utilizing tri-path collaborative recall and graph reranking transitions literature search from semantic matching to deterministic association discovery. This method anchors large language models with explicit graph traversal to mitigate logical hallucinations and lower the inference costs of deep-research agents.
The framework enables key automated research workflows, including literature review, trend synthesis, idea positioning, and academic trajectory exploration. Publicly released interfaces for knowledge graph retrieval and downstream tasks confirm its utility as a scalable cognitive map for end-to-end research automation.

Introduction

Dataset

Source and Composition: The authors construct SciATLAS using OpenAlex as the foundational data source, which originally catalogs over 480 million academic publications. The knowledge graph centers on Papers and integrates interconnected entities including Authors, Institutions, Keywords, and a four-tier disciplinary hierarchy (Domains, Fields, Subfields, and Topics).
Scale and Filtering Rules: The finalized dataset contains 43.30 million papers, 109.70 million authors, 3.76 million keywords, and 0.12 million institutions. The filtering pipeline strictly retains English publications with sufficiently long abstracts and valid PDF URLs. It normalizes and deduplicates paper titles and institution names while intentionally preserving author duplicates to handle naming ambiguity. Records lacking critical attributes are removed.
Metadata Construction and Processing: To replace OpenAlex's sparse macroscopic concepts, the authors employ a lightweight LLM to extract three to eight reusable core keywords per paper from abstracts. Each keyword receives an importance score, and co-occurrence edges are weighted by frequency to capture conceptual links. The pipeline also generates semantic vectors using bge-large-en-v1.5 for titles, abstracts, and keywords, storing them directly as node attributes to enable hybrid retrieval.
Usage and Integration: The processed graph is deployed in Neo4j and organized across four relational levels: semantic (citations and relevance), conceptual (keyword co-occurrence), directional (disciplinary hierarchy), and social (authorship and institutional affiliations). Rather than relying on traditional training splits or mixture ratios, the authors leverage the knowledge graph for topological search and reasoning. They feed chronologically ordered paper sequences and author publication lists into structured LLM prompts to generate JSON outputs for research trend prediction and academic profiling.

Method

ملف PDF المصدر عرض الكود

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

سكياتلاس: مخطط معرفي واسع النطاق للبحث العلمي الآلي

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

سكياتلاس: مخطط معرفي واسع النطاق للبحث العلمي الآلي

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

سكياتلاس: مخطط معرفي واسع النطاق للبحث العلمي الآلي

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang1 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang

Shuofei Qiao Yunxiang Wei Jiazheng Fan Bin Wu Busheng Zhang Mengru Wang Yuqi Zhu Ningyu Zhang Keyan Ding Qiang Zhang