fix(PIPL): replace mocked static float metrics with dynamic runtime e…#404
fix(PIPL): replace mocked static float metrics with dynamic runtime e…#404ARYANPATEL-BIT wants to merge 1 commit into
Conversation
…valuation Signed-off-by: Aryan Patel <aryan.patel7291@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ARYANPATEL-BIT The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @MooreZheng |
There was a problem hiding this comment.
Code Review
This pull request transitions the privacy-preserving LLM implementation from simulated metrics to dynamic calculations based on actual inference data and execution timing. It introduces new tracking for edge and cloud processing durations and implements utility scoring using standard metrics. Feedback indicates that the privacy budget and compliance violation metrics are currently initialized but not updated during prediction, which results in inaccurate evaluations. Furthermore, the network_overhead calculation is identified as representing framework processing time rather than pure network latency, suggesting a need for more precise naming.
| self.metrics['inference_count'] += 1 | ||
| self.metrics['total_inference_time'] += inference_time | ||
| self.metrics['total_edge_time'] += edge_duration | ||
| self.metrics['total_cloud_time'] += cloud_duration |
There was a problem hiding this comment.
The privacy_budget_consumed and compliance_violations metrics are initialized in the constructor but are never updated during the predict method. This results in _evaluate_privacy and _evaluate_compliance using zeroed values, which leads to incorrect evaluation results (e.g., a perfect compliance score regardless of actual violations). These should be updated alongside the other performance metrics.
self.metrics['total_cloud_time'] += cloud_duration
self.metrics['privacy_budget_consumed'] += protected_data.get('budget_consumed', 0.0)
self.metrics['compliance_violations'] = len(self.compliance_monitor.compliance_violations)| 'throughput': float(inference_count / total_time), | ||
| 'edge_processing_time': float(edge_time / inference_count), | ||
| 'cloud_processing_time': float(cloud_time / inference_count), | ||
| 'network_overhead': float(max(0, (total_time - edge_time - cloud_time)) / inference_count) |
There was a problem hiding this comment.
The calculation for network_overhead is conceptually incorrect. Since cloud_duration (which populates cloud_time) is measured around the _cloud_inference call, it already includes the actual network latency of the API request to the cloud provider. The value calculated here (total_time - edge_time - cloud_time) actually represents the local processing overhead introduced by the privacy-preserving framework (privacy analysis, encryption, and compliance auditing). Consider renaming this metric to framework_overhead or privacy_processing_time to avoid misinterpretation.
Review Summary for PR #404Recommendation: IntroductionThis PR replaces mocked static evaluation blocks in However, branch-backed reproduction shows that the PR does not restore the PIPL benchmark end-to-end, because the example still fails earlier in the Ianvs pipeline. Changed modules / files
Intended fix scope
Reproduction environment
Commands executedcd /home/aditya/work/review-pr404/examples/PIPL/edge-cloud_collaborative_learning_bench
python -m pip install -r requirements.txt
cd /home/aditya/work/review-pr404
ianvs -f examples/PIPL/edge-cloud_collaborative_learning_bench/benchmarkingjob.yaml
cd /home/aditya/work/review-pr404/examples/PIPL/edge-cloud_collaborative_learning_bench
ianvs -f benchmarkingjob.yaml
# synthetic dataset probe copied into worktree
ianvs -f benchmarkingjob.yamlObserved behaviorBefore the modified evaluator is reachable, the branch still shows:
Before vs after assessment
So the PR improves one layer, but the benchmark still cannot be restored by merging this PR alone. Hidden remaining issues
Architectural concernsThis is the most important observation from this review:
Therefore this PR improves an internal evaluator layer, but it does not fully reconnect PIPL to the Ianvs metric execution contract. CI / reproducibility implications
Final maintainer noteI would not merge this PR as the restoration PR for PIPL in its current form. The right framing is:
|
Code Review: fix(PIPL) — Replace Mocked Static Metrics with Dynamic Runtime EvaluationRecommendation: Request ChangesThis PR targets the right problem (issue #403) and the structural direction is correct, but a fundamental gap makes utility metrics still scientifically invalid after the patch. Module detailsOne file changed: What this PR fixes correctly
Critical remaining issue:
|
Description
Fixes #403
What this PR does / why we need it:
The
evaluate()pipeline for thePrivacyPreservingLLMalgorithm was previously populated entirely with stub methods returning hardcoded float values (e.g.,{'accuracy': 0.92}). This effectively bypassed the entire Ianvs benchmarking orchestration by evaluating zero ground labels and forcing the pipeline to output mathematically void "perfect" results regardless of what model/dataset was run.This PR implements genuine, dynamic evaluation mechanics:
_evaluate_utilitynow structurally parses the array of predictions versus actual true labels via Scikit-Learnmetricshandlers to properly rank the algorithm's actual usefulnesstime.time()latency trackers (edge, cloud, network) intopredict()that accumulate intoself.metrics2.3seconds response formatSpecial notes for reviewer
sklearn.metricsimport fails gracefully (returning0.0) in the event the returneddataarray does not conform to expected structural prediction schema dictionaries so that the larger test suite run does not immediately panicChecklist