Research Publication and Citation Sources
Factbase aggregates 250M+ publications from global sources, de-duplicates using ORCID/DOI/ROR identifiers, and classifies works against GINC's strategic framework for capability assessment.
- Factbase aggregates 250+ million research publications from multiple authoritative sources including Crossref, PubMed, arXiv with comprehensive citation network data.
- Sophisticated de-duplication and cross-referencing using ORCID, DOI, and ROR identifiers ensures each work is counted once and attributed accurately to authors and institutions.
- Publications are classified against GINC's strategic capability framework using machine learning and citation analysis, enabling topic-specific assessment beyond traditional academic disciplines
Factbase aggregates global research output from multiple authoritative bibliometric databases to provide comprehensive, high-quality intelligence on research capability across critical technology domains.
This article focuses specifically on research publication and citation data sources. Factbase also incorporates other intelligence streams (actors, assets, and strategic capability indicators) which are documented separately.
Our value in research intelligence lies not just in data collection, but in rigorous de-duplication, cross-referencing, and strategic classification that transforms raw publication data into actionable capability assessments.
Primary Research Data Sources
Publication Coverage
Factbase provides comprehensive coverage of global scholarly output, including:
- 250+ million scholarly works across all disciplines
- 100,000+ peer-reviewed journals from major publishers and independent presses
- Conference proceedings from leading technical societies (IEEE, ACM, and discipline-specific organisations)
- Preprint repositories including arXiv (physics, mathematics, computer science), bioRxiv (life sciences), medRxiv (health sciences), SSRN (social sciences), and other discipline-specific archives
- Open access content from institutional repositories and subject-specific platforms worldwide
Citation Networks
Citation data is sourced from multiple authoritative databases to ensure comprehensive coverage:
- Crossref – 150+ million DOI records with citation metadata from thousands of publishers
- PubMed/PubMed Central – Biomedical and life sciences publications with citation links
- Semantic Scholar – AI-enhanced citation extraction and analysis
- Publisher APIs – Direct feeds from major academic publishers
- Institutional repositories – Citations from open access platforms globally
Research Metadata Enhancement
Publication records are enriched with standardised identifiers and structured metadata:
- ORCID – Author identifiers for accurate disambiguation across name variations and institutional moves
- ROR (Research Organization Registry) – Institutional identifiers linking affiliations to standardised organisation records
- DOI (Digital Object Identifier) – Persistent identifiers for cross-referencing and de-duplication
- Funding acknowledgments – Grant numbers and funder information extracted from publication text
- Full-text analysis – Where available, enhanced with abstract and full-text content for topic classification
Factbase Research Data Processing Pipeline
Stage 1: Aggregation
Factbase collects publication records from multiple sources, capturing:
- Bibliographic metadata (title, authors, journal, date)
- Author affiliations and institutional data
- Citation relationships (both citing and cited works)
- Document identifiers (DOI, PubMed ID, arXiv ID, etc.)
- Abstract and keywords where available
Stage 2: De-duplication
The same publication may appear across multiple databases with slight variations. Factbase implements sophisticated de-duplication to ensure each work is counted exactly once:
Identifier matching:
- DOI-based matching (primary method)
- PubMed ID, arXiv ID, and other persistent identifiers
- ISBN for books and edited volumes
Fuzzy matching for records without identifiers:
- Title similarity algorithms
- Author name matching (accounting for variations)
- Publication year and venue matching
- Citation pattern analysis
Result: Each scholarly work appears once in Factbase, regardless of how many source databases index it.
Stage 3: Cross-referencing and Validation
Publication records are cross-referenced against authoritative identifier systems to improve accuracy:
Author disambiguation:
- ORCID matching where available (20%+ of authors)
- Institutional affiliation patterns
- Co-author networks
- Publication history consistency
- Name variation handling (e.g., "J. Smith" vs "Jane Smith" vs "J.A. Smith")
Institutional standardisation:
- ROR matching for ~100,000 research organisations
- Affiliation string parsing and normalisation
- Parent-child institutional relationships (e.g., department → university → country)
- Historical institution name changes and mergers
Citation validation:
- Reciprocal citation verification (if A cites B, B should list A as citing work)
- Citation count reconciliation across sources
- Self-citation identification
- Temporal consistency checks (cited work must pre-date citing work)
Result: Clean, standardised records with accurate author attribution, institutional affiliations, and citation networks.
Stage 4: Strategic Topic Classification
This is where Factbase adds unique value beyond standard bibliometric databases.
GINC National Capability Framework alignment:
Each publication is classified against GINC's proprietary topic taxonomy, which maps to strategic capability domains:
- Critical technologies (quantum computing, artificial intelligence, hypersonics, biotechnology, etc.)
- Defence domains (land, maritime, air, space, cyber, intelligence, nuclear)
- Emerging capabilities (synthetic biology, advanced materials, autonomous systems, etc.)
Classification methodology:
- Machine learning models trained on expert-curated topic definitions
- Full-text analysis where available (not just keywords)
- Citation network analysis (papers citing similar works likely share topics)
- Author expertise patterns
- Hierarchical taxonomy allowing multiple granularity levels
Result: Publications are scored and classified according to strategic relevance, not just academic discipline.
Research Data Quality Assurance
Coverage Quality
Disciplinary balance:
- We monitor coverage across disciplines to ensure no systematic gaps
- Benchmark against known publication volumes (e.g., UNESCO science statistics)
- Validate against major disciplinary databases (e.g., PubMed for biomedicine)
Temporal completeness:
- Historical coverage verified against journal archives
- Recent publications validated against publisher feeds
- Citation accumulation patterns checked for consistency
Data Accuracy
Author attribution:
- Fractional credit calculations validated against author lists
- Institutional affiliations verified against ROR and ORCID records
- Multi-affiliation cases handled according to OECD methodology
Citation accuracy:
- Cross-referenced against multiple citation sources
- Outlier citation counts flagged for manual review
- Self-citation rates monitored for anomalies
Metadata validation:
- Publication dates verified against publisher records
- Journal names standardised against ISSN registry
- Document types classified consistently
Update Frequency
Quarterly updates ensure Factbase reflects current research output:
- New publications added as indexed by source databases
- Citation counts updated to reflect latest citation activity
- Author and institutional metadata refreshed from ORCID and ROR
- Topic classifications updated based on full-text availability
Between quarterly updates:
- Critical corrections applied as needed
- High-priority topics may receive monthly updates
- User-reported issues investigated and resolved
What Makes Factbase Research Data Different
Beyond Standard Bibliometrics
Most research databases (Web of Science, Scopus, Google Scholar) provide:
- Publication counts
- Citation counts
- Basic disciplinary classification (e.g., "Computer Science", "Physics")
Factbase adds:
- Strategic topic granularity
- Not just "Computer Science" but "Quantum Machine Learning", "Adversarial AI", "Neuromorphic Computing"
- Topics aligned with national capability assessment, not academic departmental structure
- Multi-source integration
- Combines coverage from multiple databases to maximise completeness
- De-duplicates to ensure accurate counts
- Cross-validates to improve accuracy
- Quality-weighted metrics
- TMCM (Topic Median Citation Multiple) for field-normalised quality assessment
- Fractional credit for fair international comparison
- Excellence indicators (top 1%, top 10%) for breakthrough identification
- Capability-focused analysis
- Research output mapped to strategic technology domains
- Institutional and national capability profiles
- Trend analysis across multiple time windows (3Y, 5Y, 10Y, 20Y)
Integration with Broader Factbase Intelligence
Research publication data represents one component of Factbase's comprehensive capability assessment framework:
Research intelligence (this article):
- Publication output and quality
- Citation networks and research impact
- Author expertise and institutional strength
- Research trajectory and momentum
Actor intelligence (documented separately):
- Key researchers and research leaders
- Institutional profiles and strategic positioning
- Funding flows and resource allocation
- Collaboration networks
Asset intelligence (documented separately):
- Research infrastructure and facilities
- Technology demonstrators and prototypes
- Patents and intellectual property
- Commercial applications
Together, these streams provide holistic capability assessment beyond what publication metrics alone can reveal.
Transparency and Reproducibility
Factbase maintains transparency in research data:
Source attribution: We document which databases contribute to our coverage
Methodology: Detailed documentation of de-duplication, classification, and metric calculation
Version control: All analyses specify data version and calculation date
Quality flags: Records with incomplete metadata or ambiguous classification are flagged
Research Data Limitations and Caveats
What Factbase Research Data Covers
✅ Academic journal articles – comprehensive global coverage
✅ Conference proceedings – major technical conferences, particularly in computer science and engineering
✅ Preprints – from major repositories (arXiv, bioRxiv, medRxiv, SSRN)
✅ Open access content – institutional and subject repositories
What Research Data Does Not Fully Capture
❌ Patents – Not included in publication counts (tracked separately in asset intelligence)
❌ Technical reports – Government and corporate technical reports often not indexed
❌ Books – Monographs and edited volumes have limited coverage compared to journals
❌ Non-English publications – Coverage skewed toward English-language research, particularly for pre-2000 publications
❌ Classified research – Defence and national security research that is not publicly disclosed
❌ Commercial R&D – Corporate research not published in academic venues
❌ Grey literature – Working papers, policy documents, and informal publications have variable coverage
Known Research Data Biases
Geographic bias:
- Higher coverage in countries with strong open access policies
- Western institutions better represented in older records
Linguistic bias:
- English-language publications over-represented
- Non-English journals less comprehensively indexed
Disciplinary bias:
- STEM fields (science, technology, engineering, mathematics) most complete
- Humanities and some social sciences less comprehensive
Recency bias:
- Digital-era publications (post-2000) more complete than historical records
- Very recent publications (<6 months) may have incomplete citation data
Factbase acknowledges these limitations and applies appropriate caveats in analysis.
Research Data Access and Usage
Who Uses Factbase Research Data
- Government agencies – National capability assessment and S&T policy
- Defence organisations – Technology threat assessment and opportunity identification
- Research institutions – Strategic planning and benchmarking
- Policy analysts – Evidence-based research policy development
Data Ethics and Privacy
Author privacy:
- Only publicly-available publication data is used
- No personal contact information is collected or stored
- ORCID integration respects author-controlled public profiles
Institutional attribution:
- Based on author-declared affiliations in publications
- Multiple affiliations handled according to international standards
Responsible use:
- Data provided for research intelligence, not individual evaluation
- Metrics designed for aggregate (country/institution) assessment
- Not intended for hiring, promotion, or individual performance decisions
Summary
Factbase provides comprehensive, high-quality research publication and citation intelligence through:
- Broad coverage – 250+ million works from authoritative global sources
- Rigorous processing – De-duplication and cross-referencing for accuracy
- Strategic classification – Publications scored against GINC's national capability framework
- Quality metrics – TMCM and fractional credit for fair international comparison
- Continuous updates – Quarterly refreshes ensure current intelligence
Our value proposition in research data is not just collection, but transformation – converting raw bibliometric records into actionable strategic intelligence on research capability in critical technologies.
This research intelligence integrates with Factbase's broader capability assessment framework, which includes actor profiles, asset tracking, and strategic indicators to provide comprehensive national and institutional capability analysis.