By factbase in Sources — 25 Jan 2026

Research Publication and Citation Sources

Factbase aggregates 250M+ publications from global sources, de-duplicates using ORCID/DOI/ROR identifiers, and classifies works against GINC's strategic framework for capability assessment.

Factbase aggregates 250+ million research publications from multiple authoritative sources including Crossref, PubMed, arXiv with comprehensive citation network data.
Sophisticated de-duplication and cross-referencing using ORCID, DOI, and ROR identifiers ensures each work is counted once and attributed accurately to authors and institutions.
Publications are classified against GINC's strategic capability framework using machine learning and citation analysis, enabling topic-specific assessment beyond traditional academic disciplines

Factbase aggregates global research output from multiple authoritative bibliometric databases to provide comprehensive, high-quality intelligence on research capability across critical technology domains.

This article focuses specifically on research publication and citation data sources. Factbase also incorporates other intelligence streams (actors, assets, and strategic capability indicators) which are documented separately.

Our value in research intelligence lies not just in data collection, but in rigorous de-duplication, cross-referencing, and strategic classification that transforms raw publication data into actionable capability assessments.

Primary Research Data Sources

Publication Coverage

Factbase provides comprehensive coverage of global scholarly output, including:

250+ million scholarly works across all disciplines
100,000+ peer-reviewed journals from major publishers and independent presses
Conference proceedings from leading technical societies (IEEE, ACM, and discipline-specific organisations)
Preprint repositories including arXiv (physics, mathematics, computer science), bioRxiv (life sciences), medRxiv (health sciences), SSRN (social sciences), and other discipline-specific archives
Open access content from institutional repositories and subject-specific platforms worldwide

Citation Networks

Citation data is sourced from multiple authoritative databases to ensure comprehensive coverage:

Crossref – 150+ million DOI records with citation metadata from thousands of publishers
PubMed/PubMed Central – Biomedical and life sciences publications with citation links
Semantic Scholar – AI-enhanced citation extraction and analysis
Publisher APIs – Direct feeds from major academic publishers
Institutional repositories – Citations from open access platforms globally

Research Metadata Enhancement

Publication records are enriched with standardised identifiers and structured metadata:

ORCID – Author identifiers for accurate disambiguation across name variations and institutional moves
ROR (Research Organization Registry) – Institutional identifiers linking affiliations to standardised organisation records
DOI (Digital Object Identifier) – Persistent identifiers for cross-referencing and de-duplication
Funding acknowledgments – Grant numbers and funder information extracted from publication text
Full-text analysis – Where available, enhanced with abstract and full-text content for topic classification

Factbase Research Data Processing Pipeline

Stage 1: Aggregation

Factbase collects publication records from multiple sources, capturing:

Bibliographic metadata (title, authors, journal, date)
Author affiliations and institutional data
Citation relationships (both citing and cited works)
Document identifiers (DOI, PubMed ID, arXiv ID, etc.)
Abstract and keywords where available

Stage 2: De-duplication

The same publication may appear across multiple databases with slight variations. Factbase implements sophisticated de-duplication to ensure each work is counted exactly once:

Identifier matching:

DOI-based matching (primary method)
PubMed ID, arXiv ID, and other persistent identifiers
ISBN for books and edited volumes

Fuzzy matching for records without identifiers:

Title similarity algorithms
Author name matching (accounting for variations)
Publication year and venue matching
Citation pattern analysis

Result: Each scholarly work appears once in Factbase, regardless of how many source databases index it.

Stage 3: Cross-referencing and Validation

Publication records are cross-referenced against authoritative identifier systems to improve accuracy:

Author disambiguation:

ORCID matching where available (20%+ of authors)
Institutional affiliation patterns
Co-author networks
Publication history consistency
Name variation handling (e.g., "J. Smith" vs "Jane Smith" vs "J.A. Smith")

Institutional standardisation:

ROR matching for ~100,000 research organisations
Affiliation string parsing and normalisation
Parent-child institutional relationships (e.g., department → university → country)
Historical institution name changes and mergers

Citation validation:

Reciprocal citation verification (if A cites B, B should list A as citing work)
Citation count reconciliation across sources
Self-citation identification
Temporal consistency checks (cited work must pre-date citing work)

Result: Clean, standardised records with accurate author attribution, institutional affiliations, and citation networks.

Stage 4: Strategic Topic Classification

This is where Factbase adds unique value beyond standard bibliometric databases.

GINC National Capability Framework alignment:

Each publication is classified against GINC's proprietary topic taxonomy, which maps to strategic capability domains:

Critical technologies (quantum computing, artificial intelligence, hypersonics, biotechnology, etc.)
Defence domains (land, maritime, air, space, cyber, intelligence, nuclear)
Emerging capabilities (synthetic biology, advanced materials, autonomous systems, etc.)

Classification methodology:

Machine learning models trained on expert-curated topic definitions
Full-text analysis where available (not just keywords)
Citation network analysis (papers citing similar works likely share topics)
Author expertise patterns
Hierarchical taxonomy allowing multiple granularity levels

Result: Publications are scored and classified according to strategic relevance, not just academic discipline.

Research Data Quality Assurance

Coverage Quality

Disciplinary balance:

We monitor coverage across disciplines to ensure no systematic gaps
Benchmark against known publication volumes (e.g., UNESCO science statistics)
Validate against major disciplinary databases (e.g., PubMed for biomedicine)

Temporal completeness:

Historical coverage verified against journal archives
Recent publications validated against publisher feeds
Citation accumulation patterns checked for consistency

Data Accuracy

Author attribution:

Fractional credit calculations validated against author lists
Institutional affiliations verified against ROR and ORCID records
Multi-affiliation cases handled according to OECD methodology

Citation accuracy:

Cross-referenced against multiple citation sources
Outlier citation counts flagged for manual review
Self-citation rates monitored for anomalies

Metadata validation:

Publication dates verified against publisher records
Journal names standardised against ISSN registry
Document types classified consistently

Update Frequency

Quarterly updates ensure Factbase reflects current research output:

New publications added as indexed by source databases
Citation counts updated to reflect latest citation activity
Author and institutional metadata refreshed from ORCID and ROR
Topic classifications updated based on full-text availability

Between quarterly updates:

Critical corrections applied as needed
High-priority topics may receive monthly updates
User-reported issues investigated and resolved

What Makes Factbase Research Data Different

Beyond Standard Bibliometrics

Most research databases (Web of Science, Scopus, Google Scholar) provide:

Publication counts
Citation counts
Basic disciplinary classification (e.g., "Computer Science", "Physics")

Factbase adds:

Strategic topic granularity
- Not just "Computer Science" but "Quantum Machine Learning", "Adversarial AI", "Neuromorphic Computing"
- Topics aligned with national capability assessment, not academic departmental structure
Multi-source integration
- Combines coverage from multiple databases to maximise completeness
- De-duplicates to ensure accurate counts
- Cross-validates to improve accuracy
Quality-weighted metrics
- TMCM (Topic Median Citation Multiple) for field-normalised quality assessment
- Fractional credit for fair international comparison
- Excellence indicators (top 1%, top 10%) for breakthrough identification
Capability-focused analysis
- Research output mapped to strategic technology domains
- Institutional and national capability profiles
- Trend analysis across multiple time windows (3Y, 5Y, 10Y, 20Y)

Integration with Broader Factbase Intelligence

Research publication data represents one component of Factbase's comprehensive capability assessment framework:

Research intelligence (this article):

Publication output and quality
Citation networks and research impact
Author expertise and institutional strength
Research trajectory and momentum

Actor intelligence (documented separately):

Key researchers and research leaders
Institutional profiles and strategic positioning
Funding flows and resource allocation
Collaboration networks

Asset intelligence (documented separately):

Research infrastructure and facilities
Technology demonstrators and prototypes
Patents and intellectual property
Commercial applications

Together, these streams provide holistic capability assessment beyond what publication metrics alone can reveal.

Transparency and Reproducibility

Factbase maintains transparency in research data:

Source attribution: We document which databases contribute to our coverage

Methodology: Detailed documentation of de-duplication, classification, and metric calculation

Version control: All analyses specify data version and calculation date

Quality flags: Records with incomplete metadata or ambiguous classification are flagged

Research Data Limitations and Caveats

What Factbase Research Data Covers

✅ Academic journal articles – comprehensive global coverage

✅ Conference proceedings – major technical conferences, particularly in computer science and engineering

✅ Preprints – from major repositories (arXiv, bioRxiv, medRxiv, SSRN)

✅ Open access content – institutional and subject repositories

What Research Data Does Not Fully Capture

❌ Patents – Not included in publication counts (tracked separately in asset intelligence)

❌ Technical reports – Government and corporate technical reports often not indexed

❌ Books – Monographs and edited volumes have limited coverage compared to journals

❌ Non-English publications – Coverage skewed toward English-language research, particularly for pre-2000 publications

❌ Classified research – Defence and national security research that is not publicly disclosed

❌ Commercial R&D – Corporate research not published in academic venues

❌ Grey literature – Working papers, policy documents, and informal publications have variable coverage

Known Research Data Biases

Geographic bias:

Higher coverage in countries with strong open access policies
Western institutions better represented in older records

Linguistic bias:

English-language publications over-represented
Non-English journals less comprehensively indexed

Disciplinary bias:

STEM fields (science, technology, engineering, mathematics) most complete
Humanities and some social sciences less comprehensive

Recency bias:

Digital-era publications (post-2000) more complete than historical records
Very recent publications (<6 months) may have incomplete citation data

Factbase acknowledges these limitations and applies appropriate caveats in analysis.

Research Data Access and Usage

Who Uses Factbase Research Data

Government agencies – National capability assessment and S&T policy
Defence organisations – Technology threat assessment and opportunity identification
Research institutions – Strategic planning and benchmarking
Policy analysts – Evidence-based research policy development

Data Ethics and Privacy

Author privacy:

Only publicly-available publication data is used
No personal contact information is collected or stored
ORCID integration respects author-controlled public profiles

Institutional attribution:

Based on author-declared affiliations in publications
Multiple affiliations handled according to international standards

Responsible use:

Data provided for research intelligence, not individual evaluation
Metrics designed for aggregate (country/institution) assessment
Not intended for hiring, promotion, or individual performance decisions

Summary

Factbase provides comprehensive, high-quality research publication and citation intelligence through:

Broad coverage – 250+ million works from authoritative global sources
Rigorous processing – De-duplication and cross-referencing for accuracy
Strategic classification – Publications scored against GINC's national capability framework
Quality metrics – TMCM and fractional credit for fair international comparison
Continuous updates – Quarterly refreshes ensure current intelligence

Our value proposition in research data is not just collection, but transformation – converting raw bibliometric records into actionable strategic intelligence on research capability in critical technologies.

This research intelligence integrates with Factbase's broader capability assessment framework, which includes actor profiles, asset tracking, and strategic indicators to provide comprehensive national and institutional capability analysis.