HyperAIHyperAI

Command Palette

Search for a command to run...

منذ عام واحد

البحث التلقائي عن الوثائق في أكاديمية مايكروسوفت: الدقة لمقالات المجلات والملاءمة لتحليل الاستشهادات

Mike Thelwall

نشر نموذج توليد الأصول ثلاثية الأبعاد مفتوح المصدر من مايكروسوفت "تريليس" بنقرة واحدة

20 ساعة فقط من موارد حوسبة RTX 5090 $1 (قيمة $7)
الانتقال إلى دفتر

الملخص

الأكاديمي مايكروسوفت هو محرك بحث أكاديمي مجاني وفهرس استشهاداتي، يشبه غوغل شولار، ولكنه قابل للاستعلام الآلي. قد تكون بياناته مفيدة للتحليل الببليومتري إذا أمكن البحث بفعالية عن مقالات المجلات الفردية. تقارن هذه المقالة طرقًا مختلفة للعثور على مقالات المجلات في فهرسه من خلال البحث عن مجموعة من العنوان، والمؤلفين، وسنة النشر، واسم المجلة، وتستخدم النتائج لأوسع تحليل ارتباط منشور حتى الآن لعدد الاستشهاداتات في الأكاديمي مايكروسوفت لمقالات المجلات. بناءً على 126,312 مقالًا من 323 مجالًا فرعيًا في سكوبس في عام 2012، فإن الاستراتيجية المثلى للعثور على المقالات التي تحتوي على معرفات الكائن الرقمية (DOIs) هي البحث عنها بالعنوان واستبعاد تلك التي تحتوي على معرفات كائن رقمية غير صحيحة. وهذا يعثر على 90% من مقالات المجلات. بالنسبة للمقالات التي لا تحتوي على معرفات كائن رقمية، فإن الاستراتيجية المثلى هي البحث عنها بالعنوان ثم استبعاد التطابقات ذات البيانات الوصفية غير المتشابهة. وهذا يعثر على 89% من مقالات المجلات، مع وجود 1% إضافية من التطابقات غير الصحيحة. يبدو أن المقالات المتبقية غير مفهرسة بشكل رئيسي بواسطة الأكاديمي مايكروسوفت أو مفهرسة بنسخة لغوية مختلفة من عنوانها. من بين التطابقات، يبلغ متوسط ارتباط سبيرمان بين عدد الاستشهاداتات في سكوبس وأعداد الأكاديمي مايكروسوفت 0.95، مع أدنى ارتباط لأي مجال فردي يبلغ 0.63. وبالتالي، فإن عدد الاستشهاداتات في الأكاديمي مايكروسوفت يكافئ تقريبًا عدد الاستشهاداتات في سكوبس للمقالات التي ليست حديثة، ولكن هناك تحيزات وطنية في النتائج.

One-sentence Summary

By automating title-based searches and filtering mismatched metadata, this article demonstrates that Microsoft Academic retrieves approximately 90% of indexed journal articles and yields citation counts that correlate strongly (average Spearman correlation of 0.95) with Scopus, confirming its suitability for large-scale bibliometric analysis despite documented national biases.

Key Contributions

  • This paper establishes an optimal retrieval strategy for journal articles with DOIs in Microsoft Academic by combining title-based searching with DOI validation filtering, successfully recovering 90% of target articles while eliminating incorrect matches.
  • For articles without DOIs, the work introduces a title-driven search protocol that filters records with dissimilar metadata to maintain precision, retrieving 89% of target articles with only a 1% incorrect match rate.
  • A large-scale correlation analysis of 126,312 articles across 323 Scopus subfields demonstrates an average Spearman correlation of 0.95 between Microsoft Academic and Scopus citation counts, confirming their equivalence for established publications while identifying minor national indexing biases.

Introduction

Scholarly impact assessment depends on accurate citation tracking, yet matching journal articles across academic databases remains a persistent technical hurdle. Existing approaches frequently encounter metadata inconsistencies, particularly for publications without digital object identifiers, and citation counts from different sources often produce divergent impact metrics across disciplines. To address these gaps, the authors systematically evaluate multiple matching strategies for identifying journal articles in Microsoft Academic, benchmarking their accuracy both with and without DOIs. They further examine how Microsoft Academic citation counts correlate with Scopus data, ultimately delivering field-aware recommendations for reliable bibliometric analysis.

Dataset

  • Dataset Composition and Sources: The authors compile a bibliometric dataset drawn from Scopus-indexed journal articles published in 2012, paired with citation metrics and matching results retrieved from the Microsoft Academic API. This cross-database combination enables large-scale validation of citation indicators and automated document retrieval strategies.

  • Subset Details and Filtering: The initial pool captures the last 5,000 2012 articles across 335 Scopus sub-fields. After excluding seven fields with zero records and filtering out articles lacking DOIs, the dataset contains 1,005,074 documents across 326 fields. The authors then draw a random sample of 400 articles per field without replacement, ultimately removing three underpopulated fields to arrive at a final set of 126,312 journal articles distributed across 323 sub-fields.

  • Data Usage and Evaluation Framework: The authors do not employ traditional machine learning splits or mixture ratios. Instead, they use the dataset to systematically benchmark four Microsoft Academic query strategies across all 323 sub-fields. Each strategy is evaluated using precision and recall metrics, followed by Spearman correlation analyses to validate Microsoft Academic citation counts against Scopus benchmarks. Geometric means are calculated for citation counts to handle skewed distributions.

  • Metadata Construction and Processing: Query strings are built using only the first author (initial and surname), journal name, publication year, and title. All text undergoes strict normalization: conversion to lowercase, removal of accents, stripping of HTML tags, replacement of Greek letters with their phonetic equivalents, and substitution of special characters like hyphens, apostrophes, and ampersands with spaces. Match validation relies on exact DOI comparisons after lowercasing and dot removal. When DOIs are unavailable, the pipeline rejects results showing two or more metadata mismatches or title word overlaps below 85 percent.

Experiment

This study evaluated Microsoft Academic’s capacity to retrieve journal articles and match citation counts against Scopus across diverse disciplines and national contexts. Retrieval testing confirmed that optimized title-based queries achieve high precision and recall, though coverage gaps primarily stem from journal indexing limitations, multilingual title inconsistencies, and query generation errors. Citation analysis demonstrated that Microsoft Academic counts strongly align with Scopus data across most fields, validating the platform as a reliable and cost-effective alternative for bibliometric evaluation. While highly practical for researchers without institutional database access, the findings caution against unadjusted cross-national comparisons and warn that citation counts remain vulnerable to manipulation in formal assessment settings.

The authors compare different query methods for retrieving articles from Microsoft Academic, evaluating their recall and precision. Results show that title-based queries generally achieve higher recall and precision compared to full or author-title queries, with the title-only method performing best overall. The study highlights that query design significantly impacts retrieval effectiveness, and the optimal method is consistent with prior research. Title-only queries achieve the highest recall and precision compared to full or author-title queries. The optimal query method performs consistently well across different metrics, with high median and mean recall and precision. The results demonstrate that query design is a critical factor in retrieval effectiveness, with simpler title-based searches outperforming more complex queries.

The authors analyze the completeness of Microsoft Academic article matches by first author country affiliation, finding that articles from English-speaking countries have higher match rates. The the the table shows a clear trend where countries with higher percentages of English-language publications in Scopus also exhibit higher match rates in Microsoft Academic, with some countries like Australia and the United States achieving nearly 95% match rates, while others like Brazil and China have lower rates. The data suggests that language and indexing differences affect retrieval success, particularly for non-English publications. Countries with higher percentages of English-language publications in Scopus have higher match rates in Microsoft Academic. There are notable differences in match rates between countries, with English-speaking nations like Australia and the United States achieving near 95% match rates. Non-English speaking countries such as Brazil and China show lower match rates, indicating potential language and indexing challenges.

The the the table presents summary statistics for citation counts from Scopus and Microsoft Academic, showing high correlation between the two sources. Results indicate that Microsoft Academic citations generally align closely with Scopus citations, with minor differences in geometric means and strong overall correlation. The data suggest that Microsoft Academic can be a reliable alternative for citation analysis, though some variation exists across fields and articles. Microsoft Academic and Scopus citation counts show high correlation, with an average Spearman correlation of 0.948. Citation counts from both sources are generally similar, with small differences in geometric means across articles. The data suggest that Microsoft Academic can serve as a practical alternative to Scopus for citation analysis, though some variation exists between the two sources.

The experiment evaluates different query methods for retrieving journal articles from Microsoft Academic, comparing their recall and precision across various search strategies. Results show that query methods combining author and title information achieve higher recall and precision compared to methods using only title or year, with the full query method yielding the highest recall and precision overall. The analysis also highlights that the optimal method for title-based searches is consistent with prior research, and that Microsoft Academic's citation counts correlate strongly with Scopus, though national and language differences affect retrieval completeness. Query methods combining author and title information achieve higher recall and precision compared to methods using only title or year. The full query method yields the highest recall and precision across all metrics, with the highest median and mean recall and precision values. Microsoft Academic citation counts show strong correlations with Scopus, though retrieval completeness varies by national and language factors.

The experiment evaluates the effectiveness of different query methods for retrieving journal articles from Microsoft Academic, comparing recall and precision across various search strategies. Results show that query methods combining author and title information achieve the highest recall and precision, with title-only searches performing less effectively. The analysis highlights that the optimal method is consistent with prior research but yields improved performance for journal articles compared to previous studies on repository documents. Query methods combining author and title information achieve the highest recall and precision. Title-only searches show lower recall and precision compared to more comprehensive query methods. The optimal method for article retrieval is consistent with prior research but demonstrates improved performance for journal articles.

The experiments evaluate Microsoft Academic’s retrieval effectiveness by testing various query strategies, assessing country-level match completeness, and comparing citation counts against Scopus. Query design proves critical, with comprehensive methods combining author and title information generally yielding the highest recall and precision, while simpler title-based approaches remain competitive in certain contexts. Retrieval completeness varies significantly by region, as English-speaking nations and publications achieve substantially higher match rates due to language and indexing advantages. Finally, citation metrics from Microsoft Academic demonstrate strong alignment with Scopus, establishing the platform as a reliable alternative for bibliometric analysis despite minor variations across fields.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp