Dataset Doctrine
Local-first
Raw datasets stay on NAS/worker infrastructure unless an explicit access pack authorizes otherwise.
Exact dedupe first
SHA-256 hash groups identify exact duplicate files before developers train or copy data.
Review before training
Every dataset needs source, license, PHI/PII, allowed-use, and safety review.
Receipts
Every training run should name dataset IDs, hashes, split plan, evals, and output path.
Catalog Highlights
| Dataset | Records | Status | Use |
|---|---|---|---|
| MASTER_GOLD Medical JSONL | 385,626 | Canonical; exact duplicate found on remote /data2 | Medical instruction/eval research after review |
| MASTER_PLATINUM Medical JSONL | 406,181 | Canonical; exact duplicate found on remote /data2 | Higher-quality medical instruction/eval research after review |
| Medical Tribunal Ready | 417,136 | Remote restricted review | Medical safety/judge/eval research after review |
| Medical Deeds and Pairs | 3,827 | Remote restricted review | Small reasoning and receipt workflow examples |
Developer Workflow
- Request dataset IDs and intended use.
- OpenDiabetic verifies source, license, PHI/PII risk, and blocked uses.
- Approved datasets get an access pack with dataset card, manifest, split plan, safety boundaries, and receipt.
- Training jobs run on approved compute with no raw data egress.
- Models must pass safety evals: no diagnosis, no dosing, no emergency triage, no PHI leakage.
npm run datasets -- list
npm run datasets -- profile-records /mnt/swarm/swarm-and-bee-datasets/medical --out /mnt/swarm/opendiabetic-datasets/00_AUDIT_SUMMARIES/records-medicalCompute Grants
OpenDiabetic will support approved builders who need compute for privacy-first diabetic AI tooling.
RTX PRO 6000 96GB
For large fine-tunes, long-context evals, model merging, safety judges, and synthetic/de-identified generation workflows.
RTX 5090 32GB
For LoRA, adapters, smaller eval jobs, dataset profiling, RAG experiments, and agent prototypes.
Grant Requirements
Dataset IDs, run manifest, GPU-hour estimate, no-egress commitment, model card, eval plan, and safety boundary.
MONAI-Inspired Discipline
MONAI’s medical AI ecosystem shows the value of reproducible bundles, transforms as code, train/eval separation, metrics, deployment boundaries, and community tooling. OpenDiabetic applies that discipline to diabetic lifestyle agents and medical text/data workflows, not clinical deployment claims.
Contact
Developers and researchers can request dataset access or compute grants at [email protected].