The ER Oncology Dataset: Physician-Validated Records for AI Training

Q: Who validates the data?

Our physician review team personally reviews every record. No automated validation shortcut is used. Each physician has 30+ years of clinical experience.

Q: What sources are used?

ClinicalTrials.gov, PubMed, and OpenFDA. We're actively expanding to include global registries (EU Clinical Trials Register, WHO ICTRP) for complete global coverage.

Q: Can I filter by disease area or evidence grade?

Yes, use our A La Carte filter tool during checkout to select exactly what you need.

Q: Is this data suitable for AI training?

Yes. We hear the disclaimer 'AI makes mistakes' everywhere. Our datasets are designed to reduce that – by providing physician-verified, high-quality training data that filters out noise and low-evidence studies.

Curated and verified by our **Physician Review Team**. Designed for clinical decision models and health-tech engineering pipelines.

The Problem

"AI is only as good as its training data. But PubMed has 30 million articles."

AI healthcare startups face a massive problem: drowning in irrelevant, low-quality medical literature while missing the high-impact studies that matter.

Engineers waste weeks filtering noise. Models hallucinate on edge cases. Clinical decision support systems fail because they're trained on broad, unspecific data.

The Solution

"The ER Oncology Dataset — 790 Records, 10 Quality Scores, Physician-Validated"

790 oncology records focused on ER complications: febrile neutropenia, spinal cord compression, cancer thrombosis, hypercalcemia, and cancer pain
10 quality scores applied to every single record for custom subsets
Each record annotated with **physician notes** written by our Physician Review Team
Formatted in ready-to-ingest JSON and CSV formats for AI training, academic research, and regulatory submissions

How It Works

"Every Record Goes Through a Rigorous 3-Level Validation Process"

👁️

1. We review every record

Raw records are collected from ClinicalTrials.gov, PubMed, and OpenFDA matching oncology emergency profiles.

✍️

2. Add physician notes

Every record passes 10 hardcoded logic rules assessing study type, data completeness, evidence levels, and ER relevance.

✅

3. Approve or reject

Our physicians personally review, annotate, and approve every record on the clinician dashboard.

The 10 Quality Scores

"10 Scoring Rules That Filter the Noise"

Rule	What It Measures
ER Applicability	Relevance to emergency medicine (0-10 score)
Guideline Alignment	Matches current ACLS/ATLS clinical guidelines
Statistical Integrity	Verifies adequate cohort size (n ≥ 30) and significance (p < 0.05)
Outcome Relevance	Measures patient-centered metrics vs clinical surrogate outcomes
Bias Detection	Flags conflicts of interest, industry sponsorship, and single-center limits
Clinical Plausibility	Cross-references outliers against established emergency literature
Actionability	Categorizes care priorities as STAT, Routine, or N/A
Evidence Grade	Assigns grades A-F based on study design (A=Meta-analysis, F=Expert opinion)
Population Fit	Validates age and acuity matches typical ER demographics
Recency Weight	Gives higher weights to modern publications (< 5 years old)

Our Physician Review Team

"Active Board-Certified Clinicians"

- Each record is reviewed by a practicing ER physician with 30+ years of experience
- All physicians on our team maintain active clinical practice in high-acuity environments
- No names listed publicly to protect proprietary review workflows

View Sonny Saggar, MD Publications on SSRN ↗

Data Specifications & Pricing Tiers

"What You Get"

Tier	Records	Price	Includes
Mini	50	Free	Physician notes, 10 quality scorecards, requires manual approval
Starter	250	$2,000	Full dataset with annotations, dynamic filters
Growth	500	$3,500	Extended dataset + CSV/JSON format downloads
Enterprise	1,000+	$10,000+	Full API access, custom filtering, quarterly updates

Frequently Asked Questions

Q: Who validates the data?

A: Our physician review team personally reviews every record. No automated validation shortcut is used. Each physician has 30+ years of clinical experience.

Q: What sources are used?

A: ClinicalTrials.gov, PubMed, and OpenFDA. We're actively expanding to include global registries (EU Clinical Trials Register, WHO ICTRP) for complete global coverage.

Q: Can I filter by disease/condition area or evidence grade?

A: Yes, use our A La Carte filter tool during checkout to select exactly what you need.

Q: Is this data suitable for AI training?

A: Yes. We hear the disclaimer "AI makes mistakes" everywhere. Our datasets are designed to reduce that – by providing physician-verified, high-quality training data that filters out noise and low-evidence studies.

Q: What format is the data in?

A: CSV and JSON, ready for any AI pipeline. We also offer UDS (Universal Document) format for customers requiring cryptographic verification.

Q: How many records are currently available?

A: Thousands of records across multiple ER specialties, growing weekly. Our pre-September dataset captures the last comprehensive snapshot of public medical literature before AI crawlers are restricted.

Q: Do you provide updates?

A: Enterprise customers (defined as any dataset purchase of $3,500+ or custom volume) receive quarterly updates and priority support.

Q: Why is pre-September 15th important?

A: Major internet platforms are implementing new restrictions on AI crawlers starting September 15, 2026. Our pre-September dataset captures the last comprehensive snapshot of public medical literature. After this date, new records will be significantly harder to obtain.

Q: How do I know the data is accurate?

A: Every record is reviewed by a practicing ER physician. We apply 10 quality scores to filter noise. And we cryptographically seal each dataset so you can verify its integrity.

Q: Can I see a sample before buying?

A: Yes. Request a free 50-record Mini dataset. We'll send it within 24 hours.

Q: Do I need a license agreement?

A: Yes – standard research and AI training license. No commercial resale.

Q: How does this compare to raw PubMed data?

A: Raw PubMed is uncurated noise. We've applied physician judgment to filter, grade, and annotate every record. You get quality, not volume.

Ready to Train Your AI on Physician-Validated Data?

Deploy models that reflect true emergency department clinical decision-making.

Get Free Sample View Pricing Tiers Contact Curation Team