You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: research/research_home.html
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ <h3> Thematic Questions </h3>
89
89
</ul>
90
90
</li>
91
91
92
-
<li><b> Are the existing resource managers for clusters (such as SLURM, Kubernetes, or general cloud-infra) efficient, portable, and friendly enough? </b>
92
+
<li><b> Are the existing resource managers for clusters (such as SLURM, Kubernetes, or general cloud-infra) efficient, portable, and friendly enough to <em> nicely </em> support AI workloads? </b>
93
93
<ul>
94
94
<!-- <li> How should this higher-lever resource manager interact with collective programming frameworks, such as Nvidia's NCCL, AMD's RCCL, or Intel's oncCCL? Is this as efficient and scalable as it could be? <em> What about building a system which supports vendor-agnostic collective programming? </em> </li> -->
95
95
<!-- <li> <em> Given the explosion in architectures and accelerators, we would ideally like a system that is compatible with hardware from various vendors. </em> There is current support for CUDA devices, but this support is a second-class priority and the configuration does not appear to be user-friendly or scalable. SLURM interfaces with Nvidia's Multi-Process Service (MPS) and Multi-Instance (MIG) so multiple jobs can share an individual device's resources; however, there are limitations and this current structure will not be compatible with the advanced GPUs being developed by other vendors. I believe there is room for improved system design. </li>
0 commit comments