The ER Oncology Dataset: Physician-Validated Records for AI Training
Curated and verified by our **Physician Review Team**. Designed for clinical decision models and health-tech engineering pipelines.
The Problem
"AI is only as good as its training data. But PubMed has 30 million articles."
AI healthcare startups face a massive problem: drowning in irrelevant, low-quality medical literature while missing the high-impact studies that matter.
Engineers waste weeks filtering noise. Models hallucinate on edge cases. Clinical decision support systems fail because they're trained on broad, unspecific data.
The Solution
"The ER Oncology Dataset — 790 Records, 10 Quality Scores, Physician-Validated"
- 790 oncology records focused on ER complications: febrile neutropenia, spinal cord compression, cancer thrombosis, hypercalcemia, and cancer pain
- 10 quality scores applied to every single record for custom subsets
- Each record annotated with **physician notes** written by our Physician Review Team
- Formatted in ready-to-ingest JSON and CSV formats for AI training, academic research, and regulatory submissions
How It Works
"Every Record Goes Through a Rigorous 3-Level Validation Process"
1. We review every record
Raw records are collected from ClinicalTrials.gov, PubMed, and OpenFDA matching oncology emergency profiles.
2. Add physician notes
Every record passes 10 hardcoded logic rules assessing study type, data completeness, evidence levels, and ER relevance.
3. Approve or reject
Our physicians personally review, annotate, and approve every record on the clinician dashboard.
The 10 Quality Scores
"10 Scoring Rules That Filter the Noise"
| Rule | What It Measures |
|---|---|
| ER Applicability | Relevance to emergency medicine (0-10 score) |
| Guideline Alignment | Matches current ACLS/ATLS clinical guidelines |
| Statistical Integrity | Verifies adequate cohort size (n ≥ 30) and significance (p < 0.05) |
| Outcome Relevance | Measures patient-centered metrics vs clinical surrogate outcomes |
| Bias Detection | Flags conflicts of interest, industry sponsorship, and single-center limits |
| Clinical Plausibility | Cross-references outliers against established emergency literature |
| Actionability | Categorizes care priorities as STAT, Routine, or N/A |
| Evidence Grade | Assigns grades A-F based on study design (A=Meta-analysis, F=Expert opinion) |
| Population Fit | Validates age and acuity matches typical ER demographics |
| Recency Weight | Gives higher weights to modern publications (< 5 years old) |
Our Physician Review Team
"Active Board-Certified Clinicians"
- Each record is reviewed by a practicing ER physician with 30+ years of experience
- All physicians on our team maintain active clinical practice in high-acuity environments
- No names listed publicly to protect proprietary review workflows
Data Specifications & Pricing Tiers
"What You Get"
| Tier | Records | Price | Includes |
|---|---|---|---|
| Mini | 50 | Free | Physician notes, 10 quality scorecards, requires manual approval |
| Starter | 250 | $2,000 | Full dataset with annotations, dynamic filters |
| Growth | 500 | $3,500 | Extended dataset + CSV/JSON format downloads |
| Enterprise | 1,000+ | $10,000+ | Full API access, custom filtering, quarterly updates |
Frequently Asked Questions
Q: Who validates the data?
A: Our physician review team personally reviews every record. No automated validation shortcut is used. Each physician has 30+ years of clinical experience.
Q: What sources are used?
A: ClinicalTrials.gov, PubMed, and OpenFDA. We're actively expanding to include global registries (EU Clinical Trials Register, WHO ICTRP) for complete global coverage.
Q: Can I filter by disease/condition area or evidence grade?
A: Yes, use our A La Carte filter tool during checkout to select exactly what you need.
Q: Is this data suitable for AI training?
A: Yes. We hear the disclaimer "AI makes mistakes" everywhere. Our datasets are designed to reduce that – by providing physician-verified, high-quality training data that filters out noise and low-evidence studies.
Q: What format is the data in?
A: CSV and JSON, ready for any AI pipeline. We also offer UDS (Universal Document) format for customers requiring cryptographic verification.
Q: How many records are currently available?
A: Thousands of records across multiple ER specialties, growing weekly. Our pre-September dataset captures the last comprehensive snapshot of public medical literature before AI crawlers are restricted.
Q: Do you provide updates?
A: Enterprise customers (defined as any dataset purchase of $3,500+ or custom volume) receive quarterly updates and priority support.
Q: Why is pre-September 15th important?
A: Major internet platforms are implementing new restrictions on AI crawlers starting September 15, 2026. Our pre-September dataset captures the last comprehensive snapshot of public medical literature. After this date, new records will be significantly harder to obtain.
Q: How do I know the data is accurate?
A: Every record is reviewed by a practicing ER physician. We apply 10 quality scores to filter noise. And we cryptographically seal each dataset so you can verify its integrity.
Q: Can I see a sample before buying?
A: Yes. Request a free 50-record Mini dataset. We'll send it within 24 hours.
Q: Do I need a license agreement?
A: Yes – standard research and AI training license. No commercial resale.
Q: How does this compare to raw PubMed data?
A: Raw PubMed is uncurated noise. We've applied physician judgment to filter, grade, and annotate every record. You get quality, not volume.
Ready to Train Your AI on Physician-Validated Data?
Deploy models that reflect true emergency department clinical decision-making.