Conversation
|
BTW, @6erun Is it expected? I see only the following offers in |
|
Yes, we have more types now and we will update |
|
@peterschmidt85 I have added RTX 5090 and RTX PRO 6000 here dstackai/gpuhunt#158 We will start adding PRO 6000 nodes next week. These are all GPUs we have for on-demand rental at the moment; others we're offering on a month-to-month basis and cannot be exposed for on-demand for now. Is there anything else we need to look into? |
No problem. There is also a small PR in gpuhunt to enable RTX 5090 and RTX PRO 6000: dstackai/gpuhunt#158 |
There was a problem hiding this comment.
@6erun, thanks again for the PR. A couple of things didn't work for me at first, but I managed to make them work with some minor tweaks. Please see my review comments for details. I've provided suggestions for most of them, so hope they will be easy to address.
Additionally, I've merged dstackai/gpuhunt#158. However, I couldn't get RTX 5090 to work with dstack:
-
I've tried a few times to run a dev environment on
rtx59-16c-nr.2. The dev environment started successfully, but I got this error in the container shell:(workflow) root@riftvm:~# nvidia-smi -bash: nvidia-smi: command not found
And this one on the host:
riftuser@riftvm:~$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
-
I've tried running a dev environment on
rtx59-16c-nr.1once, but it didn't start. It was stuck inprovisioningfor 10 minutes. Theinstance_infovariable inupdate_provisioning_data()contained this:instance_info
{'id': '89a57132-5072-11f0-8f1e-db636327bfe3', 'status': 'Inactive', 'node_id': 'e381ba8a-1d41-11f0-aa9a-cfdc5b0bc398', 'node_mode': 'Container', 'node_status': 'Ready', 'cpu': {'vendor': 'AMD', 'vendor_logo_url': None, 'brand': 'AMD EPYC 7B13 64-Core Processor', 'brand_short': 'EPYC 7B13', 'physical_core_count': 128, 'logical_core_count': 256}, 'cpu_mask': 'ffff00000000', 'cpu_limit': 16, 'dram': 1081953382400, 'dram_limit': 107374182400, 'disk_limit': 0, 'gpus': [{'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:01:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:24:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:41:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:61:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:81:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:a1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:c1:00.0'}, {'vendor': 'NVIDIA Corporation', 'vendor_logo_url': 'https://storage.googleapis.com/cloudrift-resources/images/logo/nvidia-compound-white.svg', 'brand': 'NVIDIA GeForce RTX 5090', 'brand_short': 'RTX 5090', 'vram': 34190917632, 'pci_vendor_id': 4318, 'pci_device_id': 11141, 'pci_slot': '0000:e1:00.0'}], 'gpu_mask': '4', 'gpu_limit': 1, 'host_address': '142.214.185.236', 'resource_info': {'provider_name': 'NeuralRack', 'instance_type': 'rtx59-16c-nr.1', 'cost_per_hour': 65.0}, 'usage_info': {'usage': {'secs': 1, 'nanos': 705464000}, 'accounted_usage': {'secs': 0, 'nanos': 0}, 'user_email': '*** redacted ***'}, 'virtual_machines': [], 'containers': [], 'instructions': {'instructions_template': '*** redacted ***', 'placeholder_values': [['NODE_IP', '142.214.185.236'], ['EXECUTOR_SHORT_ID', '89a57132']]}, 'reservation_data': None}After 10 minutes,
dstacktried to terminate the instance, but also failed:ComputeError('Failed to terminate instance 89a57132-5072-11f0-8f1e-db636327bfe3 in region us-east-nc-nr-1.')We couldn't find the instance in the console after that, so I assume it wasn't created correctly.
rtx49-8c-nr.1 and rtx49-8c-nr.2 worked as expected.
| return response_data.get("instance_types", []) | ||
| return [] | ||
|
|
||
| def list_recipies(self) -> List[Dict]: |
There was a problem hiding this comment.
(nit) Typo here and in a few more places below
| def list_recipies(self) -> List[Dict]: | |
| def list_recipes(self) -> List[Dict]: |
|
Thank you @jvstme for the feedback! I think I addressed all your comments. |
|
@jvstme regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually. I think it's better to test it with 4090 for now. |
jvstme
left a comment
There was a problem hiding this comment.
Looks good to me! I'll merge the PR now, so the integration will be part of our next release this week. Thank you for the contribution.
regarding issues with 5090, we've had some issues with drivers on those machines lately, that requires us to reset the machine manually.
Okay, no worries. However, if you expect this problem to persist for some time, I can recommend to temporarily exclude 5090s from gpuhunt so that they are not suggested to users who may want to try the integration once we announce it. We can easily remove or add offers to gpuhunt without a release.
Thanks for the tip and the help with the integration! We've made some changes to the virtualization stack and will test to see if that helps with 5090 instability. If it doesn't, we'll either remove 5090 from the gpuhunt CloudRift manifest generation logic or disable it in the backend. |
Added CloudRift backend (cloudirft.ai).
gpuhuntpart was done in dstackai/gpuhunt#133