dstackai
diff --git a/‎contributing/AUTOSCALING.md‎
Lines changed: 2 additions & 0 deletions b/‎contributing/AUTOSCALING.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎contributing/RUNS-AND-JOBS.md‎
Lines changed: 1 addition & 1 deletion b/‎contributing/RUNS-AND-JOBS.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/docs/concepts/services.md‎
Lines changed: 60 additions & 0 deletions b/‎docs/docs/concepts/services.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎docs/docs/reference/dstack.yml/service.md‎
Lines changed: 16 additions & 0 deletions b/‎docs/docs/reference/dstack.yml/service.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎src/dstack/_internal/cli/utils/run.py‎
Lines changed: 150 additions & 34 deletions b/‎src/dstack/_internal/cli/utils/run.py‎
Lines changed: 150 additions & 34 deletions
diff --git a/‎src/dstack/_internal/core/compatibility/runs.py‎
Lines changed: 3 additions & 0 deletions b/‎src/dstack/_internal/core/compatibility/runs.py‎
Lines changed: 3 additions & 0 deletions
@@ -11,6 +11,8 @@
 - STEP 7: `scale_run_replicas` terminates or starts replicas.
     - `SUBMITTED` and `PROVISIONING` replicas get terminated before `RUNNING`.
     - Replicas are terminated by descending `replica_num` and launched by ascending `replica_num`.
+    - For services with `replica_groups`, only groups with autoscaling ranges (min != max) participate in scaling.
+    - Scale operations respect per-group minimum and maximum constraints.
 
 ## RPSAutoscaler
 
 
@@ -13,7 +13,7 @@ Runs are created from run configurations. There are three types of run configura
 2. `task` — runs the user's bash script until completion.
 3. `service` — runs the user's bash script and exposes a port through [dstack-proxy](PROXY.md).
 
-A run can spawn one or multiple jobs, depending on the configuration. A task that specifies multiple `nodes` spawns a job for every node (a multi-node task). A service that specifies multiple `replicas` spawns a job for every replica. A job submission is always assigned to one particular instance. If a job fails and the configuration allows retrying, the server creates a new job submission for the job.
+A run can spawn one or multiple jobs, depending on the configuration. A task that specifies multiple `nodes` spawns a job for every node (a multi-node task). A service that specifies multiple `replicas` or `replica_groups` spawns a job for every replica. Each job in a replica group is tagged with `replica_group_name` to track which group it belongs to. A job submission is always assigned to one particular instance. If a job fails and the configuration allows retrying, the server creates a new job submission for the job.
 
 ## Run's Lifecycle
 
 
@@ -160,6 +160,66 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
 > The `scaling` property requires creating a [gateway](gateways.md).
 
+### Replica Groups (Advanced)
+
+For advanced use cases, you can define multiple **replica groups** with different instance types, resources, and configurations within a single service. This is useful when you want to:
+
+- Run different GPU types in the same service (e.g., H100 for primary, RTX5090 for overflow)
+- Configure different backends or regions per replica type
+- Set different autoscaling behavior per group
+
+<div editor-title="service.dstack.yml"> 
+
+```yaml
+type: service
+name: llama31-service
+
+python: 3.12
+env:
+  - HF_TOKEN
+commands:
+  - uv pip install vllm
+  - vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096
+port: 8000
+
+# Define multiple replica groups with different configurations
+replica_groups:
+  - name: primary
+    replicas: 1              # Always 1 H100 (fixed)
+    resources:
+      gpu: H100:1
+    backends: [aws]
+    regions: [us-west-2]
+  
+  - name: overflow
+    replicas: 0..5           # Autoscales 0-5 RTX5090s
+    resources:
+      gpu: RTX5090:1
+    backends: [runpod]
+
+scaling:
+  metric: rps
+  target: 10
+```
+
+</div>
+
+In this example:
+
+- The `primary` group always runs 1 H100 replica on AWS (fixed, never scaled)
+- The `overflow` group scales 0-5 RTX5090 replicas on RunPod based on load
+- Scale operations only affect groups with autoscaling ranges (min != max)
+
+Each replica group can override any [profile parameter](../reference/profiles.yml.md) including `backends`, `regions`, `instance_types`, `spot_policy`, etc. Group-level settings override service-level settings.
+
+> **Note:** When using `replica_groups`, you cannot use the simple `replicas` field. They are mutually exclusive.
+
+**When to use replica groups:**
+
+- You need different GPU types in the same service
+- Different replicas should run in different regions or clouds
+- Some replicas should be fixed while others autoscale
+
 ### Model
 
 If the service is running a chat model with an OpenAI-compatible interface,
 
@@ -10,6 +10,22 @@ The `service` configuration type allows running [services](../../concepts/servic
       type:
         required: true
 
+### `replica_groups`
+
+Define multiple replica groups with different configurations within a single service.
+
+> **Note:** Cannot be used together with `replicas`.
+
+#### `replica_groups[n]`
+
+#SCHEMA# dstack._internal.core.models.configurations.ReplicaGroup
+    overrides:
+      show_root_heading: false
+      type:
+        required: true
+
+Each replica group inherits from [ProfileParams](../profiles.yml.md) and can override any profile parameter including `backends`, `regions`, `instance_types`, `spot_policy`, etc.
+
 ### `model` { data-toc-label="model" }
 
 === "OpenAI"
 
@@ -119,7 +119,32 @@ def th(s: str) -> str:
     if include_run_properties:
         props.add_row(th("Configuration"), run_spec.configuration_path)
         props.add_row(th("Type"), run_spec.configuration.type)
-    props.add_row(th("Resources"), pretty_req)
+    
+    from dstack._internal.core.models.configurations import ServiceConfiguration
+    
+    if (
+        include_run_properties
+        and isinstance(run_spec.configuration, ServiceConfiguration)
+        and run_spec.configuration.replica_groups
+    ):
+        groups_info = []
+        for group in run_spec.configuration.replica_groups:
+            group_parts = [f"[cyan]{group.name}[/cyan]"]
+            
+            if group.replicas.min == group.replicas.max:
+                group_parts.append(f"×{group.replicas.max}")
+            else:
+                group_parts.append(f"×{group.replicas.min}..{group.replicas.max}")
+                group_parts.append("[dim](autoscalable)[/dim]")
+            
+            group_parts.append(f"[dim]({group.resources.pretty_format()})[/dim]")
+            
+            groups_info.append(" ".join(group_parts))
+        
+        props.add_row(th("Replica groups"), "\n".join(groups_info))
+    else:
+        props.add_row(th("Resources"), pretty_req)
+    
     props.add_row(th("Spot policy"), spot_policy)
     props.add_row(th("Max price"), max_price)
     if include_run_properties:
@@ -138,45 +163,130 @@ def th(s: str) -> str:
     offers.add_column("INSTANCE TYPE", style="grey58", no_wrap=True, ratio=2)
     offers.add_column("PRICE", style="grey58", ratio=1)
     offers.add_column()
+    
+    # For replica groups, show offers from all job plans
+    if len(run_plan.job_plans) > 1:
+        # Multiple jobs - aggregate offers from all groups
+        all_offers = []
+        groups_with_no_offers = []
+        total_offers_count = 0
+        
+        for jp in run_plan.job_plans:
+            group_name = jp.job_spec.replica_group_name or "default"
+            if jp.total_offers == 0:
+                groups_with_no_offers.append(group_name)
+            for offer in jp.offers[:max_offers] if max_offers else jp.offers:
+                all_offers.append((group_name, offer))
+            total_offers_count += jp.total_offers
+        
+        # Sort by price
+        all_offers.sort(key=lambda x: x[1].price)
+        if max_offers:
+            all_offers = all_offers[:max_offers]
+        
+        # Show groups with no offers FIRST
+        for group_name in groups_with_no_offers:
+            offers.add_row(
+                "",
+                f"[cyan]{group_name}[/cyan]:",
+                "[red]No matching instance offers available.[/red]\n"
+                "Possible reasons: https://dstack.ai/docs/guides/troubleshooting/#no-offers",
+                "",
+                "",
+                "",
+                style="secondary",
+            )
+        
+        # Then show groups with offers
+        for i, (group_name, offer) in enumerate(all_offers, start=1):
+            r = offer.instance.resources
 
-    job_plan.offers = job_plan.offers[:max_offers] if max_offers else job_plan.offers
+            availability = ""
+            if offer.availability in {
+                InstanceAvailability.NOT_AVAILABLE,
+                InstanceAvailability.NO_QUOTA,
+                InstanceAvailability.IDLE,
+                InstanceAvailability.BUSY,
+            }:
+                availability = offer.availability.value.replace("_", " ").lower()
+            instance = offer.instance.name
+            if offer.total_blocks > 1:
+                instance += f" ({offer.blocks}/{offer.total_blocks})"
+            
+            # Add group name prefix for multi-group display
+            backend_display = f"[cyan]{group_name}[/cyan]: {offer.backend.replace('remote', 'ssh')} ({offer.region})"
+            
+            offers.add_row(
+                f"{i}",
+                backend_display,
+                r.pretty_format(include_spot=True),
+                instance,
+                f"${offer.price:.4f}".rstrip("0").rstrip("."),
+                availability,
+                style=None if i == 1 or not include_run_properties else "secondary",
+            )
+        
+        if total_offers_count > len(all_offers):
+            offers.add_row("", "...", style="secondary")
+    else:
+        # Single job - original logic
+        job_plan.offers = job_plan.offers[:max_offers] if max_offers else job_plan.offers
 
-    for i, offer in enumerate(job_plan.offers, start=1):
-        r = offer.instance.resources
+        for i, offer in enumerate(job_plan.offers, start=1):
+            r = offer.instance.resources
 
-        availability = ""
-        if offer.availability in {
-            InstanceAvailability.NOT_AVAILABLE,
-            InstanceAvailability.NO_QUOTA,
-            InstanceAvailability.IDLE,
-            InstanceAvailability.BUSY,
-        }:
-            availability = offer.availability.value.replace("_", " ").lower()
-        instance = offer.instance.name
-        if offer.total_blocks > 1:
-            instance += f" ({offer.blocks}/{offer.total_blocks})"
-        offers.add_row(
-            f"{i}",
-            offer.backend.replace("remote", "ssh") + " (" + offer.region + ")",
-            r.pretty_format(include_spot=True),
-            instance,
-            f"${offer.price:.4f}".rstrip("0").rstrip("."),
-            availability,
-            style=None if i == 1 or not include_run_properties else "secondary",
-        )
-    if job_plan.total_offers > len(job_plan.offers):
-        offers.add_row("", "...", style="secondary")
+            availability = ""
+            if offer.availability in {
+                InstanceAvailability.NOT_AVAILABLE,
+                InstanceAvailability.NO_QUOTA,
+                InstanceAvailability.IDLE,
+                InstanceAvailability.BUSY,
+            }:
+                availability = offer.availability.value.replace("_", " ").lower()
+            instance = offer.instance.name
+            if offer.total_blocks > 1:
+                instance += f" ({offer.blocks}/{offer.total_blocks})"
+            offers.add_row(
+                f"{i}",
+                offer.backend.replace("remote", "ssh") + " (" + offer.region + ")",
+                r.pretty_format(include_spot=True),
+                instance,
+                f"${offer.price:.4f}".rstrip("0").rstrip("."),
+                availability,
+                style=None if i == 1 or not include_run_properties else "secondary",
+            )
+        if job_plan.total_offers > len(job_plan.offers):
+            offers.add_row("", "...", style="secondary")
 
     console.print(props)
     console.print()
-    if len(job_plan.offers) > 0:
+    
+    # Check if we have offers to display
+    has_offers = False
+    if len(run_plan.job_plans) > 1:
+        has_offers = any(len(jp.offers) > 0 for jp in run_plan.job_plans)
+    else:
+        has_offers = len(job_plan.offers) > 0
+    
+    if has_offers:
         console.print(offers)
-        if job_plan.total_offers > len(job_plan.offers):
-            console.print(
-                f"[secondary] Shown {len(job_plan.offers)} of {job_plan.total_offers} offers, "
-                f"${job_plan.max_price:3f}".rstrip("0").rstrip(".")
-                + "max[/]"
-            )
+        # Show summary for multi-job plans
+        if len(run_plan.job_plans) > 1:
+            if total_offers_count > len(all_offers):
+                max_price_overall = max((jp.max_price for jp in run_plan.job_plans if jp.max_price), default=None)
+                if max_price_overall:
+                    console.print(
+                        f"[secondary] Shown {len(all_offers)} of {total_offers_count} offers, "
+                        f"${max_price_overall:3f}".rstrip("0").rstrip(".")
+                        + " max[/]"
+                    )
+        else:
+            if job_plan.total_offers > len(job_plan.offers):
+                console.print(
+                    f"[secondary] Shown {len(job_plan.offers)} of {job_plan.total_offers} offers, "
+                    f"${job_plan.max_price:3f}".rstrip("0").rstrip(".")
+                    + " max[/]"
+                )
         console.print()
     else:
         console.print(NO_OFFERS_WARNING)
@@ -233,8 +343,14 @@ def get_runs_table(
             if verbose and latest_job_submission.inactivity_secs:
                 inactive_for = format_duration_multiunit(latest_job_submission.inactivity_secs)
                 status += f" (inactive for {inactive_for})"
+            
+            job_name_parts = [f"  replica={job.job_spec.replica_num}"]
+            if job.job_spec.replica_group_name:
+                job_name_parts.append(f"[cyan]group={job.job_spec.replica_group_name}[/cyan]")
+            job_name_parts.append(f"job={job.job_spec.job_num}")
+            
             job_row: Dict[Union[str, int], Any] = {
-                "NAME": f"  replica={job.job_spec.replica_num} job={job.job_spec.job_num}"
+                "NAME": " ".join(job_name_parts)
                 + (
                     f" deployment={latest_job_submission.deployment_num}"
                     if show_deployment_num
 
@@ -151,6 +151,9 @@ def get_run_spec_excludes(run_spec: RunSpec) -> IncludeExcludeDictType:
         configuration_excludes["schedule"] = True
     if profile is not None and profile.schedule is None:
         profile_excludes.add("schedule")
+    # Exclude replica_groups for backward compatibility with older servers
+    if isinstance(configuration, ServiceConfiguration) and configuration.replica_groups is None:
+        configuration_excludes["replica_groups"] = True
     configuration_excludes["repos"] = True
 
     if configuration_excludes: