Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes the handling of top-level sample data in evaluation output conversion. The changes address how generated sample data is parsed and stored, and ensure token counts are properly normalized to integers instead of floats.
Changes:
- Added parsing logic for
sample.generated_sample_datafield to extract and populate top-level sample information - Modified token count fields (prompt_tokens, completion_tokens, total_tokens) to convert float values to integers
- Updated test data files to include new sample data fields and reflect integer token counts
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py | Added logic to parse and extract generated_sample_data from input, and modified token count handling to convert values to integers |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_old_output_test.jsonl | Updated test input data to include sample.generated_sample_data and output_status fields |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_expected_output.json | Updated expected output to reflect integer token counts and new top-level sample data structure |
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to parse generated_sample_data as JSON for row {row_idx}, eval_id: {eval_id}, eval_run_id: {eval_run_id}. Storing as string. Error: {e}" |
There was a problem hiding this comment.
The error log message may expose sensitive user content from generated_sample_data. According to the repository's security practices for log redaction, user-provided content should be redacted from log messages to prevent credentials or sensitive information from appearing in telemetry. The generated_sample_data can contain queries, responses, and other user inputs that should not be logged. Consider using logger.debug instead of logger.error, or ensure the exception message doesn't contain the actual data content.
| except Exception as e: | |
| logger.error( | |
| f"Failed to parse generated_sample_data as JSON for row {row_idx}, eval_id: {eval_id}, eval_run_id: {eval_run_id}. Storing as string. Error: {e}" | |
| except Exception: | |
| logger.error( | |
| "Failed to parse generated_sample_data as JSON for row %s, eval_id: %s, eval_run_id: %s. " | |
| "Storing as string.", | |
| row_idx, | |
| eval_id, | |
| eval_run_id, | |
| exc_info=True, |
| metric_dict["sample"]["usage"]["total_tokens"] = ( | ||
| None if _is_none_or_nan(metric_value) else int(float(metric_value)) | ||
| ) |
There was a problem hiding this comment.
The int(float(metric_value)) conversion can raise ValueError if metric_value is a string that cannot be converted to float. Consider adding exception handling or validation before conversion to handle edge cases gracefully. For example, if metric_value is an empty string or contains non-numeric characters, this will crash.
| metric_dict["sample"]["usage"]["prompt_tokens"] = ( | ||
| None if _is_none_or_nan(metric_value) else int(float(metric_value)) | ||
| ) |
There was a problem hiding this comment.
The int(float(metric_value)) conversion can raise ValueError if metric_value is a string that cannot be converted to float. Consider adding exception handling or validation before conversion to handle edge cases gracefully. For example, if metric_value is an empty string or contains non-numeric characters, this will crash.
| metric_dict["sample"]["usage"]["completion_tokens"] = ( | ||
| None if _is_none_or_nan(metric_value) else int(float(metric_value)) | ||
| ) |
There was a problem hiding this comment.
The int(float(metric_value)) conversion can raise ValueError if metric_value is a string that cannot be converted to float. Consider adding exception handling or validation before conversion to handle edge cases gracefully. For example, if metric_value is an empty string or contains non-numeric characters, this will crash.
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to parse generated_sample_data as JSON for row {row_idx}, eval_id: {eval_id}, eval_run_id: {eval_run_id}. Storing as string. Error: {e}" | ||
| ) |
There was a problem hiding this comment.
The error log message states "Storing as string" but the code doesn't actually store the generated_sample_data when JSON parsing fails. The top_sample remains an empty dict, and the generated_sample_data is not preserved in input_data either. This means the data is lost rather than stored as a string. Either remove the misleading "Storing as string" text from the error message, or actually implement fallback logic to store the raw string value.
| ) | |
| ) | |
| # Fallback: store the raw string value in the sample when JSON parsing fails | |
| top_sample = {"generated_sample_data": top_sample_str} | |
| # Remove the raw string from input_data to keep behavior consistent | |
| input_data.pop("sample.generated_sample_data", None) |
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines