Command Palette
Search for a command to run...
Recherches automatiques de documents sur Microsoft Academic : Précision pour les articles de revue et pertinence pour l'analyse des citations
Recherches automatiques de documents sur Microsoft Academic : Précision pour les articles de revue et pertinence pour l'analyse des citations
Mike Thelwall
Déploiement en un clic du modèle open source de génération d'actifs 3D TRELLIS de Microsoft
Résumé
Microsoft Academic est un moteur de recherche académique gratuit et une base de données d’indexation des citations, similaire à Google Scholar, mais interrogeable automatiquement. Ses données sont potentiellement utiles pour l’analyse bibliométrique si l’on peut effectuer des recherches efficaces pour identifier des articles de revues individuels. Cet article compare différentes méthodes pour trouver des articles de revues dans son index en recherchant une combinaison de titre, d’auteurs, d’année de publication et de nom de revue, et utilise les résultats pour réaliser l’analyse de corrélation la plus exhaustive publiée à ce jour des comptes de citations de Microsoft Academic pour les articles de revues. Sur la base de 126 312 articles provenant de 323 sous-domaines de Scopus en 2012, la stratégie optimale pour trouver des articles dotés d’un DOI consiste à les rechercher par titre et à éliminer ceux dont les DOI sont incorrects. Cette méthode permet de retrouver 90 % des articles de revues. Pour les articles sans DOI, la stratégie optimale consiste à les rechercher par titre, puis à éliminer les correspondances présentant des métadonnées peu similaires. Cette approche permet de retrouver 89 % des articles de revues, avec 1 % supplémentaire de correspondances incorrectes. Les articles restants semblent principalement ne pas être indexés par Microsoft Academic ou l’être avec une version linguistique différente de leur titre. Parmi les correspondances, les comptes de citations de Scopus et ceux de Microsoft Academic présentent une corrélation de Spearman moyenne de 0,95, la plus faible observée pour un domaine unique étant de 0,63. Ainsi, les comptes de citations de Microsoft Academic sont presque universellement équivalents aux comptes de citations de Scopus pour les articles qui ne sont pas récents, bien que des biais nationaux soient présents dans les résultats.
One-sentence Summary
By automating title-based searches and filtering mismatched metadata, this article demonstrates that Microsoft Academic retrieves approximately 90% of indexed journal articles and yields citation counts that correlate strongly (average Spearman correlation of 0.95) with Scopus, confirming its suitability for large-scale bibliometric analysis despite documented national biases.
Key Contributions
- This paper establishes an optimal retrieval strategy for journal articles with DOIs in Microsoft Academic by combining title-based searching with DOI validation filtering, successfully recovering 90% of target articles while eliminating incorrect matches.
- For articles without DOIs, the work introduces a title-driven search protocol that filters records with dissimilar metadata to maintain precision, retrieving 89% of target articles with only a 1% incorrect match rate.
- A large-scale correlation analysis of 126,312 articles across 323 Scopus subfields demonstrates an average Spearman correlation of 0.95 between Microsoft Academic and Scopus citation counts, confirming their equivalence for established publications while identifying minor national indexing biases.
Introduction
Scholarly impact assessment depends on accurate citation tracking, yet matching journal articles across academic databases remains a persistent technical hurdle. Existing approaches frequently encounter metadata inconsistencies, particularly for publications without digital object identifiers, and citation counts from different sources often produce divergent impact metrics across disciplines. To address these gaps, the authors systematically evaluate multiple matching strategies for identifying journal articles in Microsoft Academic, benchmarking their accuracy both with and without DOIs. They further examine how Microsoft Academic citation counts correlate with Scopus data, ultimately delivering field-aware recommendations for reliable bibliometric analysis.
Dataset
-
Dataset Composition and Sources: The authors compile a bibliometric dataset drawn from Scopus-indexed journal articles published in 2012, paired with citation metrics and matching results retrieved from the Microsoft Academic API. This cross-database combination enables large-scale validation of citation indicators and automated document retrieval strategies.
-
Subset Details and Filtering: The initial pool captures the last 5,000 2012 articles across 335 Scopus sub-fields. After excluding seven fields with zero records and filtering out articles lacking DOIs, the dataset contains 1,005,074 documents across 326 fields. The authors then draw a random sample of 400 articles per field without replacement, ultimately removing three underpopulated fields to arrive at a final set of 126,312 journal articles distributed across 323 sub-fields.
-
Data Usage and Evaluation Framework: The authors do not employ traditional machine learning splits or mixture ratios. Instead, they use the dataset to systematically benchmark four Microsoft Academic query strategies across all 323 sub-fields. Each strategy is evaluated using precision and recall metrics, followed by Spearman correlation analyses to validate Microsoft Academic citation counts against Scopus benchmarks. Geometric means are calculated for citation counts to handle skewed distributions.
-
Metadata Construction and Processing: Query strings are built using only the first author (initial and surname), journal name, publication year, and title. All text undergoes strict normalization: conversion to lowercase, removal of accents, stripping of HTML tags, replacement of Greek letters with their phonetic equivalents, and substitution of special characters like hyphens, apostrophes, and ampersands with spaces. Match validation relies on exact DOI comparisons after lowercasing and dot removal. When DOIs are unavailable, the pipeline rejects results showing two or more metadata mismatches or title word overlaps below 85 percent.
Experiment
This study evaluated Microsoft Academic’s capacity to retrieve journal articles and match citation counts against Scopus across diverse disciplines and national contexts. Retrieval testing confirmed that optimized title-based queries achieve high precision and recall, though coverage gaps primarily stem from journal indexing limitations, multilingual title inconsistencies, and query generation errors. Citation analysis demonstrated that Microsoft Academic counts strongly align with Scopus data across most fields, validating the platform as a reliable and cost-effective alternative for bibliometric evaluation. While highly practical for researchers without institutional database access, the findings caution against unadjusted cross-national comparisons and warn that citation counts remain vulnerable to manipulation in formal assessment settings.
The authors compare different query methods for retrieving articles from Microsoft Academic, evaluating their recall and precision. Results show that title-based queries generally achieve higher recall and precision compared to full or author-title queries, with the title-only method performing best overall. The study highlights that query design significantly impacts retrieval effectiveness, and the optimal method is consistent with prior research. Title-only queries achieve the highest recall and precision compared to full or author-title queries. The optimal query method performs consistently well across different metrics, with high median and mean recall and precision. The results demonstrate that query design is a critical factor in retrieval effectiveness, with simpler title-based searches outperforming more complex queries.
The authors analyze the completeness of Microsoft Academic article matches by first author country affiliation, finding that articles from English-speaking countries have higher match rates. The the the table shows a clear trend where countries with higher percentages of English-language publications in Scopus also exhibit higher match rates in Microsoft Academic, with some countries like Australia and the United States achieving nearly 95% match rates, while others like Brazil and China have lower rates. The data suggests that language and indexing differences affect retrieval success, particularly for non-English publications. Countries with higher percentages of English-language publications in Scopus have higher match rates in Microsoft Academic. There are notable differences in match rates between countries, with English-speaking nations like Australia and the United States achieving near 95% match rates. Non-English speaking countries such as Brazil and China show lower match rates, indicating potential language and indexing challenges.
The the the table presents summary statistics for citation counts from Scopus and Microsoft Academic, showing high correlation between the two sources. Results indicate that Microsoft Academic citations generally align closely with Scopus citations, with minor differences in geometric means and strong overall correlation. The data suggest that Microsoft Academic can be a reliable alternative for citation analysis, though some variation exists across fields and articles. Microsoft Academic and Scopus citation counts show high correlation, with an average Spearman correlation of 0.948. Citation counts from both sources are generally similar, with small differences in geometric means across articles. The data suggest that Microsoft Academic can serve as a practical alternative to Scopus for citation analysis, though some variation exists between the two sources.
The experiment evaluates different query methods for retrieving journal articles from Microsoft Academic, comparing their recall and precision across various search strategies. Results show that query methods combining author and title information achieve higher recall and precision compared to methods using only title or year, with the full query method yielding the highest recall and precision overall. The analysis also highlights that the optimal method for title-based searches is consistent with prior research, and that Microsoft Academic's citation counts correlate strongly with Scopus, though national and language differences affect retrieval completeness. Query methods combining author and title information achieve higher recall and precision compared to methods using only title or year. The full query method yields the highest recall and precision across all metrics, with the highest median and mean recall and precision values. Microsoft Academic citation counts show strong correlations with Scopus, though retrieval completeness varies by national and language factors.
The experiment evaluates the effectiveness of different query methods for retrieving journal articles from Microsoft Academic, comparing recall and precision across various search strategies. Results show that query methods combining author and title information achieve the highest recall and precision, with title-only searches performing less effectively. The analysis highlights that the optimal method is consistent with prior research but yields improved performance for journal articles compared to previous studies on repository documents. Query methods combining author and title information achieve the highest recall and precision. Title-only searches show lower recall and precision compared to more comprehensive query methods. The optimal method for article retrieval is consistent with prior research but demonstrates improved performance for journal articles.
The experiments evaluate Microsoft Academic’s retrieval effectiveness by testing various query strategies, assessing country-level match completeness, and comparing citation counts against Scopus. Query design proves critical, with comprehensive methods combining author and title information generally yielding the highest recall and precision, while simpler title-based approaches remain competitive in certain contexts. Retrieval completeness varies significantly by region, as English-speaking nations and publications achieve substantially higher match rates due to language and indexing advantages. Finally, citation metrics from Microsoft Academic demonstrate strong alignment with Scopus, establishing the platform as a reliable alternative for bibliometric analysis despite minor variations across fields.