Command Palette
Search for a command to run...
Microsoft Academic による自動文書検索:学術論文の精度および引用分析への適合性
Microsoft Academic による自動文書検索:学術論文の精度および引用分析への適合性
Mike Thelwall
Microsoft のオープンソース 3D アセット生成モデル「TRELLIS」のワンクリックデプロイメント
概要
Microsoft Academicは、Google Scholarに類似しているが自動クエリが可能な無料の学術検索エンジンおよび被引用文献インデックスである。そのデータは、個々の学術論文を効果的に検索可能であれば、文献計量分析において有用である可能性がある。本論文では、タイトル、著者、出版年、およびジャーナル名の組み合わせを検索することで、そのインデックス内の学術論文を検出する異なる方法を比較し、その結果を用いて、Microsoft Academicの学術論文に対する被引用数についてこれまでに発表された中で最も広範な相関分析を実施した。2012年の323のScopusサブフィールドに属する126,312件の論文に基づき、DOI付きの論文を検出するための最適戦略は、タイトルで検索し、誤ったDOIを持つものをフィルタリングすることである。この方法により、学術論文の90%が検出される。DOIを持たない論文については、タイトルで検索し、メタデータが類似しない一致をフィルタリングする最適戦略が採用される。この方法により、学術論文の89%が検出され、さらに1%の誤った一致が追加される。残りの論文は、主にMicrosoft Academicによってインデックス化されていないか、またはタイトルの異なる言語版でインデックス化されているものと思われる。一致した結果において、Scopusの被引用数とMicrosoft Academicの被引用数の平均Spearman相関係数は0.95であり、単一フィールドにおける最低値は0.63であった。したがって、Microsoft Academicの被引用数は、最近の論文ではない場合、Scopusの被引用数とほぼ普遍的に同等であるが、結果には国別のバイアスが存在する。
One-sentence Summary
By automating title-based searches and filtering mismatched metadata, this article demonstrates that Microsoft Academic retrieves approximately 90% of indexed journal articles and yields citation counts that correlate strongly (average Spearman correlation of 0.95) with Scopus, confirming its suitability for large-scale bibliometric analysis despite documented national biases.
Key Contributions
- This paper establishes an optimal retrieval strategy for journal articles with DOIs in Microsoft Academic by combining title-based searching with DOI validation filtering, successfully recovering 90% of target articles while eliminating incorrect matches.
- For articles without DOIs, the work introduces a title-driven search protocol that filters records with dissimilar metadata to maintain precision, retrieving 89% of target articles with only a 1% incorrect match rate.
- A large-scale correlation analysis of 126,312 articles across 323 Scopus subfields demonstrates an average Spearman correlation of 0.95 between Microsoft Academic and Scopus citation counts, confirming their equivalence for established publications while identifying minor national indexing biases.
Introduction
Scholarly impact assessment depends on accurate citation tracking, yet matching journal articles across academic databases remains a persistent technical hurdle. Existing approaches frequently encounter metadata inconsistencies, particularly for publications without digital object identifiers, and citation counts from different sources often produce divergent impact metrics across disciplines. To address these gaps, the authors systematically evaluate multiple matching strategies for identifying journal articles in Microsoft Academic, benchmarking their accuracy both with and without DOIs. They further examine how Microsoft Academic citation counts correlate with Scopus data, ultimately delivering field-aware recommendations for reliable bibliometric analysis.
Dataset
-
Dataset Composition and Sources: The authors compile a bibliometric dataset drawn from Scopus-indexed journal articles published in 2012, paired with citation metrics and matching results retrieved from the Microsoft Academic API. This cross-database combination enables large-scale validation of citation indicators and automated document retrieval strategies.
-
Subset Details and Filtering: The initial pool captures the last 5,000 2012 articles across 335 Scopus sub-fields. After excluding seven fields with zero records and filtering out articles lacking DOIs, the dataset contains 1,005,074 documents across 326 fields. The authors then draw a random sample of 400 articles per field without replacement, ultimately removing three underpopulated fields to arrive at a final set of 126,312 journal articles distributed across 323 sub-fields.
-
Data Usage and Evaluation Framework: The authors do not employ traditional machine learning splits or mixture ratios. Instead, they use the dataset to systematically benchmark four Microsoft Academic query strategies across all 323 sub-fields. Each strategy is evaluated using precision and recall metrics, followed by Spearman correlation analyses to validate Microsoft Academic citation counts against Scopus benchmarks. Geometric means are calculated for citation counts to handle skewed distributions.
-
Metadata Construction and Processing: Query strings are built using only the first author (initial and surname), journal name, publication year, and title. All text undergoes strict normalization: conversion to lowercase, removal of accents, stripping of HTML tags, replacement of Greek letters with their phonetic equivalents, and substitution of special characters like hyphens, apostrophes, and ampersands with spaces. Match validation relies on exact DOI comparisons after lowercasing and dot removal. When DOIs are unavailable, the pipeline rejects results showing two or more metadata mismatches or title word overlaps below 85 percent.
Experiment
This study evaluated Microsoft Academic’s capacity to retrieve journal articles and match citation counts against Scopus across diverse disciplines and national contexts. Retrieval testing confirmed that optimized title-based queries achieve high precision and recall, though coverage gaps primarily stem from journal indexing limitations, multilingual title inconsistencies, and query generation errors. Citation analysis demonstrated that Microsoft Academic counts strongly align with Scopus data across most fields, validating the platform as a reliable and cost-effective alternative for bibliometric evaluation. While highly practical for researchers without institutional database access, the findings caution against unadjusted cross-national comparisons and warn that citation counts remain vulnerable to manipulation in formal assessment settings.
The authors compare different query methods for retrieving articles from Microsoft Academic, evaluating their recall and precision. Results show that title-based queries generally achieve higher recall and precision compared to full or author-title queries, with the title-only method performing best overall. The study highlights that query design significantly impacts retrieval effectiveness, and the optimal method is consistent with prior research. Title-only queries achieve the highest recall and precision compared to full or author-title queries. The optimal query method performs consistently well across different metrics, with high median and mean recall and precision. The results demonstrate that query design is a critical factor in retrieval effectiveness, with simpler title-based searches outperforming more complex queries.
The authors analyze the completeness of Microsoft Academic article matches by first author country affiliation, finding that articles from English-speaking countries have higher match rates. The the the table shows a clear trend where countries with higher percentages of English-language publications in Scopus also exhibit higher match rates in Microsoft Academic, with some countries like Australia and the United States achieving nearly 95% match rates, while others like Brazil and China have lower rates. The data suggests that language and indexing differences affect retrieval success, particularly for non-English publications. Countries with higher percentages of English-language publications in Scopus have higher match rates in Microsoft Academic. There are notable differences in match rates between countries, with English-speaking nations like Australia and the United States achieving near 95% match rates. Non-English speaking countries such as Brazil and China show lower match rates, indicating potential language and indexing challenges.
The the the table presents summary statistics for citation counts from Scopus and Microsoft Academic, showing high correlation between the two sources. Results indicate that Microsoft Academic citations generally align closely with Scopus citations, with minor differences in geometric means and strong overall correlation. The data suggest that Microsoft Academic can be a reliable alternative for citation analysis, though some variation exists across fields and articles. Microsoft Academic and Scopus citation counts show high correlation, with an average Spearman correlation of 0.948. Citation counts from both sources are generally similar, with small differences in geometric means across articles. The data suggest that Microsoft Academic can serve as a practical alternative to Scopus for citation analysis, though some variation exists between the two sources.
The experiment evaluates different query methods for retrieving journal articles from Microsoft Academic, comparing their recall and precision across various search strategies. Results show that query methods combining author and title information achieve higher recall and precision compared to methods using only title or year, with the full query method yielding the highest recall and precision overall. The analysis also highlights that the optimal method for title-based searches is consistent with prior research, and that Microsoft Academic's citation counts correlate strongly with Scopus, though national and language differences affect retrieval completeness. Query methods combining author and title information achieve higher recall and precision compared to methods using only title or year. The full query method yields the highest recall and precision across all metrics, with the highest median and mean recall and precision values. Microsoft Academic citation counts show strong correlations with Scopus, though retrieval completeness varies by national and language factors.
The experiment evaluates the effectiveness of different query methods for retrieving journal articles from Microsoft Academic, comparing recall and precision across various search strategies. Results show that query methods combining author and title information achieve the highest recall and precision, with title-only searches performing less effectively. The analysis highlights that the optimal method is consistent with prior research but yields improved performance for journal articles compared to previous studies on repository documents. Query methods combining author and title information achieve the highest recall and precision. Title-only searches show lower recall and precision compared to more comprehensive query methods. The optimal method for article retrieval is consistent with prior research but demonstrates improved performance for journal articles.
The experiments evaluate Microsoft Academic’s retrieval effectiveness by testing various query strategies, assessing country-level match completeness, and comparing citation counts against Scopus. Query design proves critical, with comprehensive methods combining author and title information generally yielding the highest recall and precision, while simpler title-based approaches remain competitive in certain contexts. Retrieval completeness varies significantly by region, as English-speaking nations and publications achieve substantially higher match rates due to language and indexing advantages. Finally, citation metrics from Microsoft Academic demonstrate strong alignment with Scopus, establishing the platform as a reliable alternative for bibliometric analysis despite minor variations across fields.