finetuning

microsoft/azure-skills

Fine-tune models on Azure AI Foundry using SFT (supervised), DPO (preference), or RFT (reinforcement with graders). Covers dataset preparation, training job submission, deployment, and evaluation. USE FOR: fine-tune, SFT, DPO, RFT, training data, grader, distillation, fine-tuned model, training job, large file upload, calibrate grader, deploy fine-tuned model, evaluate fine-tuned model. DO NOT USE FOR: general model deployment without fine-tuning (use deploy-model), agent creation (use agents), prompt optimization without training (use prompt-optimizer).

Source View on skills.sh

Install

terminal

$ npx skills add https://github.com/microsoft/azure-skills --skill microsoft-foundry

SKILL.md

The instructions this skill teaches your agent.

---
name: finetuning
description: "Fine-tune models on Azure AI Foundry using SFT (supervised), DPO (preference), or RFT (reinforcement with graders). Covers dataset preparation, training job submission, deployment, and evaluation. USE FOR: fine-tune, SFT, DPO, RFT, training data, grader, distillation, fine-tuned model, training job, large file upload, calibrate grader, deploy fine-tuned model, evaluate fine-tuned model. DO NOT USE FOR: general model deployment without fine-tuning (use deploy-model), agent creation (use agents), prompt optimization without training (use prompt-optimizer)."
license: MIT
metadata:
  author: Microsoft
  version: "0.0.0-placeholder"
---

# Fine-Tuning on Azure AI Foundry

Fine-tune models using SFT (supervised), DPO (preference), or RFT (reinforcement with graders). Covers dataset prep, training, deployment, and evaluation.

## When to Use

Use this sub-skill when the user asks about:
- Fine-tuning a model (SFT, DPO, or RFT)
- Preparing, validating, or formatting training data
- Submitting, monitoring, or diagnosing training jobs
- Calibrating graders or pass thresholds for RFT
- Deploying or evaluating a fine-tuned model
- Choosing between training types (SFT vs DPO vs RFT)
- Distillation, synthetic data generation, or dataset quality scoring
- Large file uploads for training data
- Cleaning up fine-tuning resources (files, deployments)

**Do NOT use for:** General model deployment without fine-tuning (use deploy-model), agent creation (use agents), prompt optimization without training (use prompt-optimizer).

## Workflows

| Stage | Guide |
|-------|-------|
| **Quick start** | [workflows/quickstart.md](workflows/quickstart.md) |
| **Full pipeline** | [workflows/full-pipeline.md](workflows/full-pipeline.md) |
| **Create data** | [workflows/dataset-creation.md](workflows/dataset-creation.md) |
| **Iterate** | [workflows/iterative-training.md](workflows/iterative-training.md) |
| **Diagnose** | [workflows/diagnose-poor-results.md](workflows/diagnose-poor-results.md) |

## References

| Topic | File |
|-------|------|
| SFT vs DPO vs RFT | [references/training-types.md](references/training-types.md) |
| Hyperparameters | [references/hyperparameters.md](references/hyperparameters.md) |
| Data formats | [references/dataset-formats.md](references/dataset-formats.md) |
| Grader design (RFT) | [references/grader-design.md](references/grader-design.md) |
| Reward hacking | [references/reward-hacking.md](references/reward-hacking.md) |
| Agentic RFT (tools) | [references/agentic-rft.md](references/agentic-rft.md) |
| Deployment | [references/deployment.md](references/deployment.md) |
| Training curves | [references/training-curves.md](references/training-curves.md) |
| Evaluation | [references/evaluation.md](references/evaluation.md) |
| Vision fine-tuning | [references/vision-fine-tuning.md](references/vision-fine-tuning.md) |
| Large file uploads | [references/large-file-uploads.md](references/large-file-uploads.md) |
| Platform gotchas | [references/platform-gotchas.md](references/platform-gotchas.md) |

## Scripts

| Script | Purpose |
|--------|---------|
| `scripts/submit_training.py` | Submit SFT/DPO/RFT jobs |
| `scripts/monitor_training.py` | Poll job until completion |
| `scripts/calibrate_grader.py` | Find optimal RFT pass_threshold |
| `scripts/check_training.py` | Analyze curves, list checkpoints |
| `scripts/deploy_model.py` | Deploy via ARM REST API |
| `scripts/evaluate_model.py` | LLM judge evaluation |
| `scripts/convert_dataset.py` | Convert between SFT/DPO/RFT formats |
| `scripts/generate_distillation_data.py` | Generate synthetic training data |
| `scripts/score_dataset.py` | Quality scoring on training data |
| `scripts/cleanup.py` | Delete old files and deployments |
| `scripts/validate/` | Data validators (SFT, DPO, RFT) + stats |

## Rules

1. **Always baseline first** — evaluate the base model before fine-tuning
2. **Validate data** before submitting — run `scripts/validate/validate_sft.py`
3. **Calibrate RFT graders** — target 25-50% failure rate on the base model
4. **Evaluate checkpoints** — don't blindly deploy the final one
5. **Measure token cost** alongside accuracy when comparing models

## Quick Reference

| Task | Command |
|------|---------|
| Validate SFT data | `python scripts/validate/validate_sft.py data.jsonl` |
| Submit SFT job | `python scripts/submit_training.py --model gpt-4.1-mini --training-file train.jsonl --validation-file val.jsonl --type sft` |
| Monitor job | `python scripts/monitor_training.py --job-id ftjob-xxx` |
| Analyze curves | `python scripts/check_training.py --job-id ftjob-xxx` |
| Deploy model | `python scripts/deploy_model.py --model-id ft:gpt-4.1-mini:... --name my-eval` |
| Evaluate model | `python scripts/evaluate_model.py --deployment-name my-eval --test-file test.jsonl` |

## Error Handling

| Error | Cause | Fix |
|-------|-------|-----|
| "API version not supported" | Older `openai` SDK on `/v1/` endpoint | Upgrade to `openai>=1.0` |
| "does not support fine-tuning with Standard TrainingType" | OSS model needs `globalStandard` | Use `--use-rest` flag or script auto-falls back |
| Job stuck in post-training eval | Under-provisioned tool endpoint (RFT) | Scale to S2+, enable Always On |
| "DeploymentNotReady" after ARM succeeds | ARM/data-plane race condition | Delete and recreate deployment, wait 5 min |
| Content safety block at deployment | PII-dense training data | Remove problematic document types |

Related skills

More from microsoft/azure-skills and the wider catalog.

azure-ai

Official

microsoft/azure-skills

Use for Azure AI: Search, Speech, OpenAI, Document Intelligence. Helps with search, vector/hybrid search, speech-to-text, text-to-speech, transcription, OCR. WHEN: AI Search, query search, vector search, hybrid search, semantic search, speech-to-text, text-to-speech, transcribe, OCR, convert text to speech.

421k installsAudited

azure-deploy

Official

microsoft/azure-skills

Execute Azure deployments for ALREADY-PREPARED applications that have existing .azure/deployment-plan.md and infrastructure files. DO NOT use this skill when the user asks to CREATE a new application — use azure-prepare instead. This skill runs azd up, azd deploy, terraform apply, and az deployment commands with built-in error recovery. Requires .azure/deployment-plan.md from azure-prepare and validated status from azure-validate. WHEN: \"run azd up\", \"run azd deploy\", \"execute deployment\", \"push to production\", \"push to cloud\", \"go live\", \"ship it\", \"bicep deploy\", \"terraform apply\", \"publish to Azure\", \"launch on Azure\". DO NOT USE WHEN: \"create and deploy\", \"build and deploy\", \"create a new app\", \"set up infrastructure\", \"create and deploy to Azure using Terraform\" — use azure-prepare for these.

420k installsAudited

azure-diagnostics

Official

microsoft/azure-skills

Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot app service, app service high CPU, app service deployment failure, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter.

420k installs

azure-prepare

Official

microsoft/azure-skills

Prepare Azure apps for deployment (infra Bicep/Terraform, azure.yaml, Dockerfiles). Use for create/modernize or create+deploy; not cross-cloud migration (use azure-cloud-migrate). DO NOT USE FOR: copilot-sdk apps (use azure-hosted-copilot-sdk), or Python code-only App Service deploys (use python-appservice-deploy). WHEN: \"create app\", \"build web app\", \"create API\", \"modernize application\", \"host on Azure\", \"deploy to Azure\", \"deploy to Azure using Terraform\", \"deploy to Azure App Service\", \"deploy to Azure App Service using Terraform\", \"deploy to Azure Container Apps\", \"generate Terraform\", \"generate Bicep\", \"function app\", \"timer trigger\", \"service bus trigger\", \"event-driven function\", \"managed identity\".

420k installsAudited

azure-storage

Official

microsoft/azure-skills

Azure Storage Services including Blob Storage, File Shares, Queue Storage, Table Storage, and Data Lake. Answers questions about storage access tiers (hot, cool, cold, archive), when to use each tier, and tier comparison. Provides object storage, SMB file shares, async messaging, NoSQL key-value, and big data analytics. Includes lifecycle management. USE FOR: blob storage, file shares, queue storage, table storage, data lake, upload files, download blobs, storage accounts, access tiers, storage tiers, hot cool cold archive, storage tier comparison, when to use storage tiers, lifecycle management, Azure Storage concepts. DO NOT USE FOR: SQL databases, Cosmos DB (use azure-prepare), messaging with Event Hubs or Service Bus (use azure-messaging).

420k installsAudited

azure-validate

Official

microsoft/azure-skills

Agent skill from microsoft/azure-skills.

420k installs