airunway-aks-setup
microsoft/azure-skills
Set up AI Runway on AKS from bare cluster to running model in six steps.
What is airunway-aks-setup?
This skill guides you through installing AI Runway on an existing AKS cluster, including controller setup, GPU assessment, inference provider configuration, and first model deployment. Use it when you need end-to-end AI Runway onboarding or to resume a partially-complete setup.
- Verify cluster connectivity, node inventory, and GPU detection
- Install AI Runway controller and custom resource definitions (CRDs)
- Assess GPU hardware compatibility and flag dtype/attention constraints
- Recommend and install an inference provider (KAITO, Dynamo, or KubeRay)
- Deploy and verify your first AI model on AKS
- Report cluster health status at each step
How to install airunway-aks-setup
npx skills add https://github.com/microsoft/azure-skills --skill airunway-aks-setup- An existing AKS cluster (provision via azure-kubernetes skill if needed)
- kubectl CLI configured with cluster context
- make and curl CLI tools available
- Understanding of GPU costs (A100-80GB: $3–5+/hr)
How to use airunway-aks-setup
- 1.Run step 1 to verify cluster connectivity, node inventory, and GPU detection
- 2.Run step 2 to install the AI Runway controller and CRDs
- 3.Run step 3 to assess GPU hardware compatibility and constraints
- 4.Run step 4 to choose and install an inference provider
- 5.Run step 5 to deploy and verify your first model
- 6.Run step 6 to review summary and next steps
- 7.Use skip-to-step N argument to resume from a specific phase if setup was interrupted
Use cases
- Setting up AI Runway on an existing AKS cluster from scratch
- Installing the AI Runway controller and CRDs for model serving
- Assessing GPU hardware compatibility before deploying inference workloads
- Choosing and installing an inference provider for your cluster
- Deploying a first AI model to AKS via AI Runway
- Platform engineers setting up AKS for AI workloads
- DevOps teams onboarding model serving infrastructure
- ML engineers deploying inference on Kubernetes
- Azure administrators managing GPU clusters
airunway-aks-setup FAQ
No, but GPU is required for efficient model inference. CPU-only inference is acceptable for testing. GPU node pools incur significant compute charges.
Use the azure-kubernetes skill first to provision an AKS cluster (optionally with a GPU node pool), then return to this skill.
Yes. Provide skip-to-step N to start at a specific phase; prior steps are assumed complete.
Step 4 recommends and installs from KAITO, Dynamo, or KubeRay based on your cluster configuration.
Check controller logs with kubectl logs -n airunway-system -l control-plane=controller-manager --previous to diagnose config or RBAC issues.
Full instructions (SKILL.md)
Source of truth, from microsoft/azure-skills.
name: airunway-aks-setup description: "Set up AI Runway on AKS — from bare cluster to running model. Covers cluster verification, controller install, GPU assessment, provider setup, and first deployment. WHEN: "setup AI Runway", "onboard AKS cluster", "install AI Runway", "airunway setup", "deploy model to AKS", "GPU inference on AKS", "KAITO setup on AKS", "run LLM on AKS", "vLLM on AKS", "set up model serving on AKS", "AI Runway controller"." license: MIT metadata: author: Microsoft version: "1.0.1" argument-hint: "[skip-to-step N]"
AI Runway AKS Setup
This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.
Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.
Prerequisites
This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.
Quick Reference
| Property | Value |
|---|---|
| Best for | End-to-end AI Runway onboarding on AKS |
| CLI tools | kubectl, make, curl |
| MCP tools | None |
| Related skills | azure-kubernetes (cluster setup), azure-diagnostics (troubleshooting) |
When to Use This Skill
Use this skill when the user wants to:
- Set up AI Runway on an existing AKS cluster from scratch
- Install the AI Runway controller and CRDs
- Assess GPU hardware compatibility for model deployment
- Choose and install an inference provider (KAITO, Dynamo, KubeRay)
- Deploy their first AI model to AKS via AI Runway
- Resume a partially-complete AI Runway setup from a specific step
MCP Tools
This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.
Rules
- Execute steps in sequence — load the reference for each step as you reach it
- Report cluster state at each step: ✓ healthy, ✗ missing/failed
- Ask for user confirmation before any install or deployment action
- If a step is already complete, report status and skip to the next step
- If the user provides
skip-to-step N, start at step N; assume prior steps are complete
Steps
| # | Step | Reference |
|---|---|---|
| 1 | Cluster Verification — context check, node inventory, GPU detection | step-1-verify.md |
| 2 | Controller Installation — CRD + controller deployment | step-2-controller.md |
| 3 | GPU Assessment — detect GPU models, flag dtype/attention constraints | step-3-gpu.md |
| 4 | Provider Setup — recommend and install inference provider | step-4-provider.md |
| 5 | First Deployment — pick a model, deploy, verify Ready | step-5-deploy.md |
| 6 | Summary — recap, smoke test, next steps | step-6-summary.md |
Error Handling
| Error / Symptom | Likely Cause | Remediation |
|---|---|---|
| No kubeconfig context | Not connected to a cluster | Run az aks get-credentials or equivalent |
| Controller in CrashLoopBackOff | Config or RBAC issue | kubectl logs -n airunway-system -l control-plane=controller-manager --previous |
| Provider not ready | Image pull or RBAC issue | kubectl logs <pod-name> -n <namespace> for the provider pod |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | kubectl describe modeldeployment <name> -n <namespace> events |
bfloat16 errors at inference | T4 or V100 lacks bfloat16 support | Add --dtype float16 to serving args |
For full error handling and rollback procedures, see troubleshooting.md.
Related skills
More from microsoft/azure-skills and the wider catalog.
finetuning
Fine-tune models on Azure AI Foundry with SFT, DPO, or RFT training methods.
azure-ai
Azure AI services skill for Search, Speech, OpenAI, and Document Intelligence in coding agents
azure-deploy
Execute Azure deployments for prepared applications with built-in error recovery and validation.
azure-diagnostics
Debug Azure production issues using AppLens, Azure Monitor, resource health, and systematic triage.
azure-prepare
Generate Azure deployment infrastructure (Bicep/Terraform, azure.yaml, Dockerfiles) for new or existing apps
azure-storage
Azure Storage skill: Blob, File Shares, Queue, Table, and Data Lake with access tier guidance and lifecycle management