OpenDiabetic Dataset

Medical datasets for privacy-first diabetic AI agents.

A local-first catalog, dedupe, review, and compute-grant workflow for developers building diabetic lifestyle agents, medical document organizers, safety evaluators, and non-diagnostic education tools.

1.21Mdedicated medical records after exact-file dedupe, before semantic dedupe
791kexact duplicate medical records identified on the remote worker
96GBRTX PRO 6000 Blackwell compute grant tier
0raw data harvesting as a business model
Medical safety boundary: These datasets are for research, training/evaluation infrastructure, lifestyle support agents, and non-diagnostic education tooling. They are not approved for diagnosis, medication dosing, or emergency triage.

Dataset Doctrine

Local-first

Raw datasets stay on NAS/worker infrastructure unless an explicit access pack authorizes otherwise.

Exact dedupe first

SHA-256 hash groups identify exact duplicate files before developers train or copy data.

Review before training

Every dataset needs source, license, PHI/PII, allowed-use, and safety review.

Receipts

Every training run should name dataset IDs, hashes, split plan, evals, and output path.

Catalog Highlights

DatasetRecordsStatusUse
MASTER_GOLD Medical JSONL385,626Canonical; exact duplicate found on remote /data2Medical instruction/eval research after review
MASTER_PLATINUM Medical JSONL406,181Canonical; exact duplicate found on remote /data2Higher-quality medical instruction/eval research after review
Medical Tribunal Ready417,136Remote restricted reviewMedical safety/judge/eval research after review
Medical Deeds and Pairs3,827Remote restricted reviewSmall reasoning and receipt workflow examples

Developer Workflow

  1. Request dataset IDs and intended use.
  2. OpenDiabetic verifies source, license, PHI/PII risk, and blocked uses.
  3. Approved datasets get an access pack with dataset card, manifest, split plan, safety boundaries, and receipt.
  4. Training jobs run on approved compute with no raw data egress.
  5. Models must pass safety evals: no diagnosis, no dosing, no emergency triage, no PHI leakage.
npm run datasets -- list
npm run datasets -- profile-records /mnt/swarm/swarm-and-bee-datasets/medical --out /mnt/swarm/opendiabetic-datasets/00_AUDIT_SUMMARIES/records-medical

Compute Grants

OpenDiabetic will support approved builders who need compute for privacy-first diabetic AI tooling.

RTX PRO 6000 96GB

For large fine-tunes, long-context evals, model merging, safety judges, and synthetic/de-identified generation workflows.

RTX 5090 32GB

For LoRA, adapters, smaller eval jobs, dataset profiling, RAG experiments, and agent prototypes.

Grant Requirements

Dataset IDs, run manifest, GPU-hour estimate, no-egress commitment, model card, eval plan, and safety boundary.

MONAI-Inspired Discipline

MONAI’s medical AI ecosystem shows the value of reproducible bundles, transforms as code, train/eval separation, metrics, deployment boundaries, and community tooling. OpenDiabetic applies that discipline to diabetic lifestyle agents and medical text/data workflows, not clinical deployment claims.

MONAI reference

Contact

Developers and researchers can request dataset access or compute grants at [email protected].