Skip to content

Latest commit

 

History

History
355 lines (275 loc) · 9.65 KB

File metadata and controls

355 lines (275 loc) · 9.65 KB

Data Preparation Guide

This guide walks through the data preparation workflow using the Nova Forge SDK's dataset loaders.

Note: If you are upgrading from an earlier version of the SDK, some method names and parameters have changed. The old names still work but will emit deprecation warnings. See Migrating from Earlier Versions at the end of this guide.

Overview

The SDK provides three dataset loaders for different file formats, all sharing the same chainable API:

from amzn_nova_forge import (
    CSVDatasetLoader,
    JSONDatasetLoader,
    JSONLDatasetLoader,
    Model,
    TrainingMethod,
)
from amzn_nova_forge.dataset.operations.transform_operation import TransformMethod
from amzn_nova_forge.dataset.operations.validate_operation import ValidateMethod

The typical workflow is: load → transform → validate → save. Each method returns the loader instance, so you can chain them:

loader = JSONLDatasetLoader()
loader.load("data.jsonl").transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "q", "answer": "a"},
).validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
).save("output.jsonl")

Loading Data

JSONL

loader = JSONLDatasetLoader()
loader.load("training_data.jsonl")

# Also supports S3 paths
loader.load("s3://my-bucket/data/training_data.jsonl")

JSON

loader = JSONDatasetLoader()
loader.load("training_data.json")

CSV

loader = CSVDatasetLoader()
loader.load("training_data.csv")

Inspecting Data

Use show() to preview the current state of the dataset at any point in the pipeline:

loader.load("data.jsonl")
loader.show()        # Shows first 10 rows
loader.show(n=3)     # Shows first 3 rows

Transforming Data

transform() applies a transformation to the dataset. The method parameter selects which transformation to run.

Method Description
TransformMethod.SCHEMA (default) Convert between data format schemas (e.g., generic Q/A → Converse, OpenAI → Converse)

Schema Transforms

Schema transforms convert your data into the format required by a specific training method and model combination. If the data is already in the correct format, it's a no-op.

Column Mappings

When your source data uses different column names than what the SDK expects, provide a column_mappings dict to transform(). The format is {"standard_name": "your_column_name"}.

SFT (Supervised Fine-Tuning)

# From plain Q/A format to Converse format
loader = JSONLDatasetLoader()
loader.load("qa_data.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "q", "answer": "a"},
)

The SDK also handles OpenAI Messages format automatically — no column mappings needed:

loader = JSONLDatasetLoader()
loader.load("openai_format.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)

RFT (Reinforcement Fine-Tuning)

loader = JSONLDatasetLoader()
loader.load("rft_data.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.RFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "prompt", "reference_answer": "answer"},
)

CPT (Continued Pre-Training)

loader = JSONLDatasetLoader()
loader.load("documents.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.CPT,
    model=Model.NOVA_LITE_2,
    column_mappings={"text": "content"},
)

Evaluation

loader = JSONLDatasetLoader()
loader.load("eval_data.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.EVALUATION,
    model=Model.NOVA_LITE_2,
    column_mappings={"query": "question", "response": "answer"},
)

Column Mapping Reference

Training Method Required Mappings Optional Mappings
SFT question, answer system, image_format/video_format, s3_uri, bucket_owner
RFT question, reference_answer system, id, tools
Evaluation query, response images, metadata
CPT text

Validating Data

validate() checks that the dataset conforms to requirements. The method parameter selects which validation to run.

Method Description
ValidateMethod.SCHEMA (default) Validate dataset structure against the requirements for a training method and model

Schema Validation

Run schema validation after transforming to catch issues before submitting a training job.

loader.validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)

Checks include:

  • Required fields are present
  • Field types and formats are correct
  • Role alternation in conversations (user/assistant)
  • Optional field consistency across samples
  • Forbidden keywords in content

For evaluation datasets, pass the eval_task parameter:

from amzn_nova_forge import EvaluationTask

loader.validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.EVALUATION,
    model=Model.NOVA_LITE_2,
    eval_task=EvaluationTask.GEN_QA,
)

Splitting Data

split() divides the dataset into train, validation, and test sets.

train, val, test = loader.split()  # Default: 80/10/10 split

# Custom ratios
train, val, test = loader.split(
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    seed=42,
)

# Save each split
train.save("train.jsonl")
val.save("val.jsonl")
test.save("test.jsonl")

Each returned loader is fully functional — you can call show(), transform(), validate(), or save() on it.


Saving Data

save() writes the current dataset to a local file or S3 path. Supports .json and .jsonl formats.

# Save locally
loader.save("output/training_data.jsonl")

# Save to S3
loader.save("s3://my-bucket/data/training_data.jsonl")

End-to-End Examples

SFT from CSV

loader = CSVDatasetLoader()
loader.load("raw_data.csv")
loader.show(n=2)

loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "input", "answer": "output", "system": "system_prompt"},
)
loader.validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)
loader.save("s3://my-bucket/sft_training_data.jsonl")

RFT with train/test split

loader = JSONLDatasetLoader()
loader.load("rft_data.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.RFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={"question": "prompt", "reference_answer": "answer"},
)

train, val, test = loader.split(train_ratio=0.9, val_ratio=0.05, test_ratio=0.05)
train.save("s3://my-bucket/rft_train.jsonl")
val.save("s3://my-bucket/rft_val.jsonl")

OpenAI format with tool calls (Nova 2.0)

loader = JSONLDatasetLoader()
loader.load("openai_with_tools.jsonl")

# Tool calls are only supported on Nova 2.0
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)
loader.validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)
loader.save("converse_with_tools.jsonl")

Multimodal data with images

loader = JSONLDatasetLoader()
loader.load("image_captions.jsonl")
loader.transform(
    method=TransformMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
    column_mappings={
        "question": "caption_prompt",
        "answer": "caption",
        "image_format": "image_format",
        "s3_uri": "s3_uri",
        "bucket_owner": "bucket_owner",
    },
)
loader.validate(
    method=ValidateMethod.SCHEMA,
    training_method=TrainingMethod.SFT_LORA,
    model=Model.NOVA_LITE_2,
)
loader.save("s3://my-bucket/multimodal_training.jsonl")

Migrating from Earlier Versions

If you are upgrading from an earlier version of the SDK, the following changes apply. The old syntax continues to work but will emit deprecation warnings.

Old Syntax What Changed New Syntax
JSONLDatasetLoader(question="q", answer="a") Column mappings moved to transform() JSONLDatasetLoader() — pass column_mappings to transform() instead
loader.transform(TrainingMethod.SFT_LORA, Model.NOVA_LITE) method param renamed to training_method; new method param selects transform type loader.transform(method=TransformMethod.SCHEMA, training_method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE)
loader.transform(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE) Same as above Same as above
loader.validate(TrainingMethod.SFT_LORA, Model.NOVA_LITE) method param renamed to training_method; new method param selects validation type loader.validate(method=ValidateMethod.SCHEMA, training_method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE)
loader.validate(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE) Same as above Same as above
loader.save_data("output.jsonl") Method renamed loader.save("output.jsonl")
loader.split_data(0.8, 0.1, 0.1) Method renamed loader.split(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1)