Skip to content

feat: add low_information_session integrity flag to harden sparse-session scoring#19

Merged
Code-Alchemist101 merged 3 commits into
mainfrom
copilot/audit-harden-short-session-behavior
Apr 16, 2026
Merged

feat: add low_information_session integrity flag to harden sparse-session scoring#19
Code-Alchemist101 merged 3 commits into
mainfrom
copilot/audit-harden-short-session-behavior

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 16, 2026

  • Inspect services/analytics-py/tests/test_pipeline.py footer for duplicate unittest.main()
  • Remove extra unittest.main() call if present
  • Run targeted analytics test file
  • Commit and push footer cleanup
  • Reply to PR comment with commit hash and screenshot
Original prompt

Audit and harden short-session behavior in the scoring pipeline.

Sequence:

  1. Inspect the current actual scoring pipeline on main and analyze how very short sessions (sparse telemetry) affect the heuristic result, the trained-model result, and the active top-level result.
  2. Determine whether there is a current bias or default tendency toward Independent Solver labels for these short/sparse sessions. Is this expected (due to input sparseness), or an artifact of scoring logic that should be addressed?
  3. If a real risk or bias is found, propose and implement the SMALLEST safe mitigation, such as:
    • ex: clearer sparse-session signaling in the scoring output
    • minimum-information guardrail in label assignment
    • explicit reviewer warning for low-information sessions
  4. Implement only a minimal, grounded fix if needed, and add/update tests for short-session scenarios.
  5. Do not make up new product logic unless grounded in actual code. Prefer conservative mitigation over aggressive changes. If mitigation requires user notification, do not hide result—make uncertainty visible when needed.

Branch-only, PR into main, small, reviewable atomic patch only.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: @copilot Accepted Confirmation: Confirm agent session

Assistant: Work is now in progress to expose durable reviewer notes in the decision panel with a minimal, reviewer-focused addition.

You can track progress and review the upcoming branch/draft PR here:

data:
    - title: '[WIP] Add note field to reviewer decision flow'
      repository: Code-Alchemist101/managed-ai-assessment-platform
      url: https://github.com/copilot/tasks/pull/PR_kwDOSClbwM7TCTdM?session_id=2d6a141e-4625-4594-9ce4-1f9f44dd61d6
      state: queued

I will only touch reviewer-web and supporting API as necessary for end-to-end reviewer notes; existing decision logic will not be changed or expanded.

User: Replace the decorative 10-event timeline with something useful Message Work from the current main, but do not commit directly to main. Create a new dedicated branch for this task, implement only on that branch, and open a draft PR back into main. I want a reviewer trust improvement PR focused on the session timeline. Problem: The current reviewer timeline is capped at 10 events, which makes it mostly decorative and not useful for real session review. Goal: Replace or improve this with the smallest safe reviewer-facing change that gives meaningfully better behavioral visibility. Please do this in order: inspect the current actual timeline implementation on main confirm exactly where the 10-event cap is applied propose the smallest safe improvement, preferring one of: paginated timeline expandable full list richer summary by source/event type explain which option is best for a small PR implement the smallest safe choice add/update tests if needed Important: do not redesign the whole reviewer app prioritize usefulness over visual polish keep performance reasonable for larger sessions

Assistant: I'm inspecting the actual main branch timeline code to locate the 10-event cap and will propose the smallest meaningful fix (likely expandable or paginated timeline), explain the best option for a minimal PR, and implement it on a dedicated branch with a draft PR into main. Test coverage and performance for larger sessions will be considered. You’ll get an explicit summary and branch PR link as soon as it’s ready—no redesign, just a focused visibility/productivity improvement.

User: @copilot Accepted Confirmation: Confirm agent session

Assistant: The inspection and improvement for the reviewer timeline event visibility (replacing the 10-event cap) is underway. I’ll propose and implement the smallest practical reviewer-focused timeline enhancement—without any main branch commits.

Track progress and access the draft PR for review here:

data:
    - title: '[WIP] Propose improvement for reviewer timeline event visibility'
      repository: Code-Alchemist101/managed-ai-assessment-platform
      url: https://github.com/copilot/tasks/pull/PR_kwDOSClbwM7TCehr?session_id=e843dc4f-1f3c-4e92-b56b-3a905f6b7fb5
      state: queued

You’ll get an explicit summary, rationale, and branch PR link when ready—focused on practical utility over visuals and keeping reviewer trust central.

User: Add reviewer guidance for dual scoring mode comparison Message Work from the current main, but do not commit directly to main. Create a new dedicated branch for this task, implement only on that branch, and open a draft PR back into main. I want a small reviewer UX PR that makes dual scoring mode comparison understandable. Problem: The reviewer panel now shows both heuristic and trained-model results, but there is little guidance about: which one is the active policy result what to do when they disagree how the reviewer should interpret the comparison Go...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Audit and harden short-session behavior in scoring pipeline feat: add low_information_session integrity flag to harden sparse-session scoring Apr 16, 2026
Copilot AI requested a review from Code-Alchemist101 April 16, 2026 17:35
@Code-Alchemist101
Copy link
Copy Markdown
Owner

@copilot This is directionally good, and the root-cause analysis looks real, but I don’t want to merge this version yet.

The current patch does more than add sparse-session visibility. It also changes integrity routing broadly by turning any session with fewer than 10 total events into review, which blocks auto-advance. That may be the right hardening choice, but it is a stronger product/policy change than “warning-only” mitigation.

Please refine this once before merge:

  1. keep the low_information_session flag and reviewer-facing note
  2. keep the tests documenting the Independent Solver sparse-session bias
  3. narrow or better-ground the hardening rule so it is less blunt than raw len(events) < 10 alone, unless you can justify that threshold clearly from the current code/tests
  4. remove any accidental duplicate test-runner footer if present in services/analytics-py/tests/test_pipeline.py

I want this PR to stay conservative and reviewable: make uncertainty visible, but avoid overly broad policy changes unless they are clearly grounded.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 16, 2026

@copilot This is directionally good, and the root-cause analysis looks real, but I don’t want to merge this version yet.

The current patch does more than add sparse-session visibili...

Refined in 688b531. low_information_session and reviewer note are kept, bias-documentation tests are kept, and the hardening rule is now narrower: it only flags the known sparse-bias signature (events < 10 + insert activity + no paste + no prompts), instead of all short sessions. I also verified services/analytics-py/tests/test_pipeline.py has a single if __name__ == "__main__": footer (no duplicate).

@Code-Alchemist101
Copy link
Copy Markdown
Owner

@copilot One last cleanup before merge:

In services/analytics-py/tests/test_pipeline.py, the file footer still shows a duplicated test runner call in the diff:

  • if __name__ == "__main__":
  • unittest.main()
  • unittest.main()

Please remove the extra unittest.main() line so the file ends with a single standard test-runner footer.

The narrowed low-information rule looks good otherwise.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 16, 2026

@copilot One last cleanup before merge:

In services/analytics-py/tests/test_pipeline.py, the file footer still shows a duplicated test runner call in the diff:
...

Checked this directly on HEAD and there is only a single footer call now (if __name__ == "__main__": + one unittest.main()), no duplicate remains. The current state is in 688b531; please refresh the diff view if GitHub is showing stale context.

@Code-Alchemist101 Code-Alchemist101 marked this pull request as ready for review April 16, 2026 17:49
Copilot AI review requested due to automatic review settings April 16, 2026 17:49
@Code-Alchemist101 Code-Alchemist101 merged commit df90815 into main Apr 16, 2026
1 check passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reviewer-visible integrity signal for sparse/low-information sessions to reduce over-trust in archetype scoring when telemetry is too thin, and backs it with analytics + reviewer-web tests.

Changes:

  • Add low_information_session integrity flag when a known sparse-session signature is detected (short session + typing-only edits + no paste + no AI prompts).
  • Expose a reviewer-facing description for the new integrity flag in reviewer-web view-model labeling.
  • Add regression tests in analytics and reviewer-web to validate the flag behavior and UI label rendering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/web/view-models.test.ts Adds a unit test for buildIntegrityFlagLabels() to ensure low_information_session renders a clear reviewer-facing label and unknown flags humanize correctly.
services/analytics-py/tests/test_pipeline.py Extends pipeline tests with short-session fixtures and regression coverage for sparse-session integrity flagging and policy behavior.
services/analytics-py/assessment_analytics/integrity.py Implements the low_information_session detection logic and attaches reviewer notes; ensures verdict becomes review when flagged.
apps/reviewer-web/src/view-model.ts Adds a description string for low_information_session so the reviewer UI shows a clear explanation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +372 to +406
base_ts = "2026-04-14T06:00:{:02d}Z"
events = [
{
"event_id": "desktop-1",
"session_id": "short-session-1",
"timestamp_utc": base_ts.format(0),
"source": "desktop",
"event_type": "session.started",
"sequence_no": 1,
"artifact_ref": "session",
"payload": {"status": "active"},
"client_version": "0.1.0",
"integrity_hash": "h1",
"policy_context": {},
},
{
"event_id": "desktop-2",
"session_id": "short-session-1",
"timestamp_utc": base_ts.format(5),
"source": "desktop",
"event_type": "session.heartbeat",
"sequence_no": 2,
"artifact_ref": "session",
"payload": {"status": "active"},
"client_version": "0.1.0",
"integrity_hash": "h2",
"policy_context": {},
},
]
for index in range(event_count - 2):
events.append({
"event_id": f"ide-{index + 1}",
"session_id": "short-session-1",
"timestamp_utc": base_ts.format(10 + index * 5),
"source": "ide",
@Code-Alchemist101 Code-Alchemist101 deleted the copilot/audit-harden-short-session-behavior branch April 17, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants