Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
363 changes: 71 additions & 292 deletions README.md

Large diffs are not rendered by default.

514 changes: 514 additions & 0 deletions docs/iam_setup.md

Large diffs are not rendered by default.

545 changes: 545 additions & 0 deletions docs/job_notifications.md

Large diffs are not rendered by default.

536 changes: 528 additions & 8 deletions docs/spec.md

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Troubleshooting (FAQ)

## Overview

This document provides guidance for common problems that might be faced when using the Nova Forge SDK.

## Permissions-Based Issues
### Unable to Deploy a Custom Model to Bedrock
If you are unable to use the SDK's built-in `deploy()` function due to permissioning issues, you can manually call the Bedrock APIs to import and deploy your models.
This will still require some IAM permissions to be set up.
The steps are outlined below.

#### Step 1: Locate Your Training Artifacts and extract your checkpoint_s3_path
* First, find where your training job is saved (`output_s3_path`) in S3.
* For SMTJ jobs, follow the steps [here](https://docs.aws.amazon.com/nova/latest/nova2-userguide/nova-iterative-training.html#nova-iterative-how-it-works) to get the s3 escrow location where your model is saved.
* For SMHP, when you navigate to your `output_s3_path` S3 folder, open the `manifest.json` file which will only contain the `checkpoint_s3_path` value.

#### Step 2: Import your custom model from S3 Escrow
* Follow the steps to [Create a Custom Model](https://docs.aws.amazon.com/bedrock/latest/userguide/create-custom-model-sdks.html) here.
* Provide the `checkpoint_s3_path` value from Step 1 for the `s3Uri` value under `modelSourceConfig`.

#### Step 3: Deploy your custom model in Bedrock
* After your custom model is imported from escrow, you can follow the steps [here](https://docs.aws.amazon.com/bedrock/latest/userguide/deploy-custom-model-on-demand.html#deploy-custom-model) to deploy the model to Bedrock using the Console, AWS CLI, or Bedrock APIs.

#### Notes:
* If you're running into permission issues with importing and deploying your custom Nova model, please review the AWS documentation: [Create a service role for importing pre-trained models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-import-iam-role.html).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ readme = "README.md"
dependencies = [
"boto3",
"matplotlib",
"sagemaker==3.5.0",
"sagemaker>=3.5.0",
]

[project.optional-dependencies]
Expand Down
33 changes: 32 additions & 1 deletion samples/nova_quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -652,6 +652,37 @@
"monitor.plot_metrics(training_method=TrainingMethod.SFT_LORA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### C) Enable Job Notifications (SMTJ, SMHP)\n",
"* The Forge SDK provides job notification support via email. \n",
"* For more information, refer to [`docs/job_notifications.md`](job_notifications.md)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Enable job notifications for SMTJ (given a TrainingResult object)\n",
"# Note: This will attempt to set up a CloudFormation stack if the infrastructure isn't deployed yet.\n",
"training_result.enable_job_notifications(\n",
" emails=[\"user@example.com\"],\n",
" # kms_key_id=\"1234abcd-12ab-34cd-56ef-1234567890ab\", # Optional customer KMS key\n",
")\n",
"\n",
"# Enable job notifications for SMHP (given a TrainingResult object)\n",
"training_result.enable_job_notifications(\n",
" emails=[\"user@example.com\"],\n",
" namespace=\"kubeflow\", # REQUIRED: Kubernetes namespace where job runs\n",
" kubectl_layer_arn=\"arn:aws:lambda:<region>:123456789012:layer:kubectl:1\", # REQUIRED\n",
" # kms_key_id=\"1234abcd-12ab-34cd-56ef-1234567890ab\" # Optional customer KMS key\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -1139,7 +1170,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
"version": "3.12.4"
}
},
"nbformat": 4,
Expand Down
9 changes: 5 additions & 4 deletions samples/rft_multiturn_quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -403,9 +403,7 @@
"source": [
"## Step 5: Evaluation Setup and Execution (Optional)\n",
"\n",
"You can run evaluation before training to test your base model, or after training to test your fine-tuned model.\n",
"\n",
"**Important**: RFT multiturn evaluation uses the environment's built-in examples, NOT a separate dataset. The environment generates examples based on the `num_examples` parameter."
"You can run evaluation before training to test your base model, or after training to test your fine-tuned model.\n"
]
},
{
Expand Down Expand Up @@ -659,7 +657,6 @@
")\n",
"\n",
"print(f\"\\n✅ Evaluation started: {eval_result.job_id}\")\n",
"print(f\" Environment will generate {50} examples\")\n",
"eval_result.dump(file_name=\"eval_result.json\")"
]
},
Expand All @@ -669,6 +666,8 @@
"metadata": {},
"outputs": [],
"source": [
"# from amzn_nova_forge.model.result import EvaluationResult\n",
"\n",
"# Load evaluation result from file (e.g., after notebook restart)\n",
"# loaded_eval_result = EvaluationResult.load(\"eval_result.json\")\n",
"# print(\"✅ Evaluation result loaded from file\")\n",
Expand Down Expand Up @@ -1018,6 +1017,8 @@
"metadata": {},
"outputs": [],
"source": [
"# from amzn_nova_forge.model.result import TrainingResult\n",
"\n",
"# Load training result from file (e.g., after notebook restart)\n",
"# loaded_training_result = TrainingResult.load(\"training_result.json\")\n",
"# print(\"✅ Training result loaded from file\")\n",
Expand Down
Loading
Loading