Research Publication and Citation Sources

Factbase aggregates 250M+ publications from global sources, de-duplicates using ORCID/DOI/ROR identifiers, and classifies works against GINC's strategic framework for capability assessment.

AI SUMMARY
  • Factbase aggregates 250+ million research publications from multiple authoritative sources including Crossref, PubMed, arXiv with comprehensive citation network data.
  • Sophisticated de-duplication and cross-referencing using ORCID, DOI, and ROR identifiers ensures each work is counted once and attributed accurately to authors and institutions.
  • Publications are classified against GINC's strategic capability framework using machine learning and citation analysis, enabling topic-specific assessment beyond traditional academic disciplines

Factbase aggregates global research output from multiple authoritative bibliometric databases to provide comprehensive, high-quality intelligence on research capability across critical technology domains.

This article focuses specifically on research publication and citation data sources. Factbase also incorporates other intelligence streams (actors, assets, and strategic capability indicators) which are documented separately.

Our value in research intelligence lies not just in data collection, but in rigorous de-duplication, cross-referencing, and strategic classification that transforms raw publication data into actionable capability assessments.


Primary Research Data Sources

Publication Coverage

Factbase provides comprehensive coverage of global scholarly output, including:

  • 250+ million scholarly works across all disciplines
  • 100,000+ peer-reviewed journals from major publishers and independent presses
  • Conference proceedings from leading technical societies (IEEE, ACM, and discipline-specific organisations)
  • Preprint repositories including arXiv (physics, mathematics, computer science), bioRxiv (life sciences), medRxiv (health sciences), SSRN (social sciences), and other discipline-specific archives
  • Open access content from institutional repositories and subject-specific platforms worldwide

Citation Networks

Citation data is sourced from multiple authoritative databases to ensure comprehensive coverage:

  • Crossref – 150+ million DOI records with citation metadata from thousands of publishers
  • PubMed/PubMed Central – Biomedical and life sciences publications with citation links
  • Semantic Scholar – AI-enhanced citation extraction and analysis
  • Publisher APIs – Direct feeds from major academic publishers
  • Institutional repositories – Citations from open access platforms globally

Research Metadata Enhancement

Publication records are enriched with standardised identifiers and structured metadata:

  • ORCID – Author identifiers for accurate disambiguation across name variations and institutional moves
  • ROR (Research Organization Registry) – Institutional identifiers linking affiliations to standardised organisation records
  • DOI (Digital Object Identifier) – Persistent identifiers for cross-referencing and de-duplication
  • Funding acknowledgments – Grant numbers and funder information extracted from publication text
  • Full-text analysis – Where available, enhanced with abstract and full-text content for topic classification

Factbase Research Data Processing Pipeline

Stage 1: Aggregation

Factbase collects publication records from multiple sources, capturing:

  • Bibliographic metadata (title, authors, journal, date)
  • Author affiliations and institutional data
  • Citation relationships (both citing and cited works)
  • Document identifiers (DOI, PubMed ID, arXiv ID, etc.)
  • Abstract and keywords where available

Stage 2: De-duplication

The same publication may appear across multiple databases with slight variations. Factbase implements sophisticated de-duplication to ensure each work is counted exactly once:

Identifier matching:

  • DOI-based matching (primary method)
  • PubMed ID, arXiv ID, and other persistent identifiers
  • ISBN for books and edited volumes

Fuzzy matching for records without identifiers:

  • Title similarity algorithms
  • Author name matching (accounting for variations)
  • Publication year and venue matching
  • Citation pattern analysis

Result: Each scholarly work appears once in Factbase, regardless of how many source databases index it.

Stage 3: Cross-referencing and Validation

Publication records are cross-referenced against authoritative identifier systems to improve accuracy:

Author disambiguation:

  • ORCID matching where available (20%+ of authors)
  • Institutional affiliation patterns
  • Co-author networks
  • Publication history consistency
  • Name variation handling (e.g., "J. Smith" vs "Jane Smith" vs "J.A. Smith")

Institutional standardisation:

  • ROR matching for ~100,000 research organisations
  • Affiliation string parsing and normalisation
  • Parent-child institutional relationships (e.g., department → university → country)
  • Historical institution name changes and mergers

Citation validation:

  • Reciprocal citation verification (if A cites B, B should list A as citing work)
  • Citation count reconciliation across sources
  • Self-citation identification
  • Temporal consistency checks (cited work must pre-date citing work)

Result: Clean, standardised records with accurate author attribution, institutional affiliations, and citation networks.

Stage 4: Strategic Topic Classification

This is where Factbase adds unique value beyond standard bibliometric databases.

GINC National Capability Framework alignment:

Each publication is classified against GINC's proprietary topic taxonomy, which maps to strategic capability domains:

  • Critical technologies (quantum computing, artificial intelligence, hypersonics, biotechnology, etc.)
  • Defence domains (land, maritime, air, space, cyber, intelligence, nuclear)
  • Emerging capabilities (synthetic biology, advanced materials, autonomous systems, etc.)

Classification methodology:

  • Machine learning models trained on expert-curated topic definitions
  • Full-text analysis where available (not just keywords)
  • Citation network analysis (papers citing similar works likely share topics)
  • Author expertise patterns
  • Hierarchical taxonomy allowing multiple granularity levels

Result: Publications are scored and classified according to strategic relevance, not just academic discipline.


Research Data Quality Assurance

Coverage Quality

Disciplinary balance:

  • We monitor coverage across disciplines to ensure no systematic gaps
  • Benchmark against known publication volumes (e.g., UNESCO science statistics)
  • Validate against major disciplinary databases (e.g., PubMed for biomedicine)

Temporal completeness:

  • Historical coverage verified against journal archives
  • Recent publications validated against publisher feeds
  • Citation accumulation patterns checked for consistency

Data Accuracy

Author attribution:

  • Fractional credit calculations validated against author lists
  • Institutional affiliations verified against ROR and ORCID records
  • Multi-affiliation cases handled according to OECD methodology

Citation accuracy:

  • Cross-referenced against multiple citation sources
  • Outlier citation counts flagged for manual review
  • Self-citation rates monitored for anomalies

Metadata validation:

  • Publication dates verified against publisher records
  • Journal names standardised against ISSN registry
  • Document types classified consistently

Update Frequency

Quarterly updates ensure Factbase reflects current research output:

  • New publications added as indexed by source databases
  • Citation counts updated to reflect latest citation activity
  • Author and institutional metadata refreshed from ORCID and ROR
  • Topic classifications updated based on full-text availability

Between quarterly updates:

  • Critical corrections applied as needed
  • High-priority topics may receive monthly updates
  • User-reported issues investigated and resolved

What Makes Factbase Research Data Different

Beyond Standard Bibliometrics

Most research databases (Web of Science, Scopus, Google Scholar) provide:

  • Publication counts
  • Citation counts
  • Basic disciplinary classification (e.g., "Computer Science", "Physics")

Factbase adds:

  1. Strategic topic granularity
    • Not just "Computer Science" but "Quantum Machine Learning", "Adversarial AI", "Neuromorphic Computing"
    • Topics aligned with national capability assessment, not academic departmental structure
  2. Multi-source integration
    • Combines coverage from multiple databases to maximise completeness
    • De-duplicates to ensure accurate counts
    • Cross-validates to improve accuracy
  3. Quality-weighted metrics
    • TMCM (Topic Median Citation Multiple) for field-normalised quality assessment
    • Fractional credit for fair international comparison
    • Excellence indicators (top 1%, top 10%) for breakthrough identification
  4. Capability-focused analysis
    • Research output mapped to strategic technology domains
    • Institutional and national capability profiles
    • Trend analysis across multiple time windows (3Y, 5Y, 10Y, 20Y)

Integration with Broader Factbase Intelligence

Research publication data represents one component of Factbase's comprehensive capability assessment framework:

Research intelligence (this article):

  • Publication output and quality
  • Citation networks and research impact
  • Author expertise and institutional strength
  • Research trajectory and momentum

Actor intelligence (documented separately):

  • Key researchers and research leaders
  • Institutional profiles and strategic positioning
  • Funding flows and resource allocation
  • Collaboration networks

Asset intelligence (documented separately):

  • Research infrastructure and facilities
  • Technology demonstrators and prototypes
  • Patents and intellectual property
  • Commercial applications

Together, these streams provide holistic capability assessment beyond what publication metrics alone can reveal.

Transparency and Reproducibility

Factbase maintains transparency in research data:

Source attribution: We document which databases contribute to our coverage

Methodology: Detailed documentation of de-duplication, classification, and metric calculation

Version control: All analyses specify data version and calculation date

Quality flags: Records with incomplete metadata or ambiguous classification are flagged


Research Data Limitations and Caveats

What Factbase Research Data Covers

Academic journal articles – comprehensive global coverage

Conference proceedings – major technical conferences, particularly in computer science and engineering

Preprints – from major repositories (arXiv, bioRxiv, medRxiv, SSRN)

Open access content – institutional and subject repositories

What Research Data Does Not Fully Capture

Patents – Not included in publication counts (tracked separately in asset intelligence)

Technical reports – Government and corporate technical reports often not indexed

Books – Monographs and edited volumes have limited coverage compared to journals

Non-English publications – Coverage skewed toward English-language research, particularly for pre-2000 publications

Classified research – Defence and national security research that is not publicly disclosed

Commercial R&D – Corporate research not published in academic venues

Grey literature – Working papers, policy documents, and informal publications have variable coverage

Known Research Data Biases

Geographic bias:

  • Higher coverage in countries with strong open access policies
  • Western institutions better represented in older records

Linguistic bias:

  • English-language publications over-represented
  • Non-English journals less comprehensively indexed

Disciplinary bias:

  • STEM fields (science, technology, engineering, mathematics) most complete
  • Humanities and some social sciences less comprehensive

Recency bias:

  • Digital-era publications (post-2000) more complete than historical records
  • Very recent publications (<6 months) may have incomplete citation data

Factbase acknowledges these limitations and applies appropriate caveats in analysis.


Research Data Access and Usage

Who Uses Factbase Research Data

  • Government agencies – National capability assessment and S&T policy
  • Defence organisations – Technology threat assessment and opportunity identification
  • Research institutions – Strategic planning and benchmarking
  • Policy analysts – Evidence-based research policy development

Data Ethics and Privacy

Author privacy:

  • Only publicly-available publication data is used
  • No personal contact information is collected or stored
  • ORCID integration respects author-controlled public profiles

Institutional attribution:

  • Based on author-declared affiliations in publications
  • Multiple affiliations handled according to international standards

Responsible use:

  • Data provided for research intelligence, not individual evaluation
  • Metrics designed for aggregate (country/institution) assessment
  • Not intended for hiring, promotion, or individual performance decisions

Summary

Factbase provides comprehensive, high-quality research publication and citation intelligence through:

  1. Broad coverage – 250+ million works from authoritative global sources
  2. Rigorous processing – De-duplication and cross-referencing for accuracy
  3. Strategic classification – Publications scored against GINC's national capability framework
  4. Quality metrics – TMCM and fractional credit for fair international comparison
  5. Continuous updates – Quarterly refreshes ensure current intelligence

Our value proposition in research data is not just collection, but transformation – converting raw bibliometric records into actionable strategic intelligence on research capability in critical technologies.

This research intelligence integrates with Factbase's broader capability assessment framework, which includes actor profiles, asset tracking, and strategic indicators to provide comprehensive national and institutional capability analysis.

Subscribe to Factbase Docs

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe