OpenDiabetic Dataset — Medical AI Data and Compute Grants

Medical safety boundary: These datasets are for research, training/evaluation infrastructure, lifestyle support agents, and non-diagnostic education tooling. They are not approved for diagnosis, medication dosing, or emergency triage.

Dataset Doctrine

Local-first

Raw datasets stay on NAS/worker infrastructure unless an explicit access pack authorizes otherwise.

Exact dedupe first

SHA-256 hash groups identify exact duplicate files before developers train or copy data.

Review before training

Every dataset needs source, license, PHI/PII, allowed-use, and safety review.

Receipts

Every training run should name dataset IDs, hashes, split plan, evals, and output path.

Catalog Highlights

Dataset	Records	Status	Use
MASTER_GOLD Medical JSONL	385,626	Canonical; exact duplicate found on remote /data2	Medical instruction/eval research after review
MASTER_PLATINUM Medical JSONL	406,181	Canonical; exact duplicate found on remote /data2	Higher-quality medical instruction/eval research after review
Medical Tribunal Ready	417,136	Remote restricted review	Medical safety/judge/eval research after review
Medical Deeds and Pairs	3,827	Remote restricted review	Small reasoning and receipt workflow examples

Developer Workflow

Request dataset IDs and intended use.
OpenDiabetic verifies source, license, PHI/PII risk, and blocked uses.
Approved datasets get an access pack with dataset card, manifest, split plan, safety boundaries, and receipt.
Training jobs run on approved compute with no raw data egress.
Models must pass safety evals: no diagnosis, no dosing, no emergency triage, no PHI leakage.

npm run datasets -- list
npm run datasets -- profile-records /mnt/swarm/swarm-and-bee-datasets/medical --out /mnt/swarm/opendiabetic-datasets/00_AUDIT_SUMMARIES/records-medical

Compute Grants

OpenDiabetic will support approved builders who need compute for privacy-first diabetic AI tooling.

RTX PRO 6000 96GB

For large fine-tunes, long-context evals, model merging, safety judges, and synthetic/de-identified generation workflows.

RTX 5090 32GB

For LoRA, adapters, smaller eval jobs, dataset profiling, RAG experiments, and agent prototypes.

Grant Requirements

Dataset IDs, run manifest, GPU-hour estimate, no-egress commitment, model card, eval plan, and safety boundary.

MONAI-Inspired Discipline

MONAI’s medical AI ecosystem shows the value of reproducible bundles, transforms as code, train/eval separation, metrics, deployment boundaries, and community tooling. OpenDiabetic applies that discipline to diabetic lifestyle agents and medical text/data workflows, not clinical deployment claims.

MONAI reference

Contact

Developers and researchers can request dataset access or compute grants at [email protected].