Skip to content

[PULL REQUEST] New actual/implied hhp balancing methodology#208

Open
Eric-Liu-SANDAG wants to merge 2 commits intomainfrom
actual-implied-hhp-balancing
Open

[PULL REQUEST] New actual/implied hhp balancing methodology#208
Eric-Liu-SANDAG wants to merge 2 commits intomainfrom
actual-implied-hhp-balancing

Conversation

@Eric-Liu-SANDAG
Copy link
Contributor

Describe this pull request. What changes are being made?

New actual/implied hhp balancing methodology. This change was made mostly for speed purposes

What issues does this pull request address?

Additional context

See the issue for old and new timing

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new methodology for balancing actual vs. implied household population (HHP) in the Household Characteristics module, aiming to improve runtime performance while keeping MGRA household-size distributions consistent with MGRA-level HHP controls.

Changes:

  • Refactors MGRA HHP alignment from a deterministic stepwise shifting loop to a weighted-random adjustment routine applied per MGRA row.
  • Adds post-adjustment validation to ensure implied min/max HHP aligns with MGRA hhp_total, raising an error on failure.
  • Reshapes the adjusted wide household-size table back into the long format output via melt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Eric-Liu-SANDAG
Copy link
Contributor Author

Runtime is now approximately 14 minutes per year, even including the employment module.

SELECT *, [end_date] - [start_date]
FROM [EstimatesProgram].[metadata].[run]
WHERE [run_id] = 187

@Eric-Liu-SANDAG
Copy link
Contributor Author

Just need to do some output comparisons between old and new methodologies before this PR will be ready

@Eric-Liu-SANDAG
Copy link
Contributor Author

The following dynamic SQL query (dynamic sql my beloved 😍) compares 2024 Estimates ([run_id]=82) and the new methodology test runs ([run_id]=187])

DECLARE @base_run_id NVARCHAR(MAX) = '82';
DECLARE @other_run_id NVARCHAR(MAX) = '187';
DECLARE @year NVARCHAR(MAX) = '2020';
DECLARE @group_geo NVARCHAR(MAX) = 'jurisdiction';

DECLARE @query NVARCHAR(MAX) = '
WITH [base] AS (
    SELECT 
        [run_id],
        [year],
        [' + @group_geo + '],
        [metric],
        SUM([value]) AS [' + @base_run_id + '_value]
    FROM [EstimatesProgram].[outputs].[hh_characteristics]
    INNER JOIN [demographic_warehouse].[dim].[mgra]
        ON [hh_characteristics].[mgra] = [mgra].[mgra]
        AND [series] = 15
    INNER JOIN [demographic_warehouse].[dim].[mgra_xref]
        ON [mgra].[mgra_id] = [mgra_xref].[mgra_id]
        AND [xref_year] = 9999
    WHERE [run_id] = ' + @base_run_id + '
        AND [year] = ' + @year + '
        AND [metric] LIKE ''%Household Size%''
    GROUP BY [run_id], [year], [' + @group_geo + '], [metric]
),
[other] AS (
    SELECT 
        [run_id],
        [year],
        [' + @group_geo + '],
        [metric],
        SUM([value]) AS [' + @other_run_id + '_value]
    FROM [EstimatesProgram].[outputs].[hh_characteristics]
    INNER JOIN [demographic_warehouse].[dim].[mgra]
        ON [hh_characteristics].[mgra] = [mgra].[mgra]
        AND [series] = 15
    INNER JOIN [demographic_warehouse].[dim].[mgra_xref]
        ON [mgra].[mgra_id] = [mgra_xref].[mgra_id]
        AND [xref_year] = 9999
    WHERE [run_id] = ' + @other_run_id + '
        AND [year] = ' + @year + '
        AND [metric] LIKE ''%Household Size%''
    GROUP BY [run_id], [year], [' + @group_geo + '], [metric]
)

SELECT 
    [base].[year],
    [base].[' + @group_geo + '],
    [base].[metric],
    [' + @base_run_id + '_value],
    [' + @other_run_id + '_value]
FROM [base]
INNER JOIN [other]
    ON [base].[year] = [other].[year]
    AND [base].[' + @group_geo + '] = [other].[' + @group_geo + ']
    AND [base].[metric] = [other].[metric]
ORDER BY [base].[year], [base].[' + @group_geo + '], [base].[metric]
'
EXEC sp_executesql @query;

@Eric-Liu-SANDAG
Copy link
Contributor Author

I think the changes are for the better, but I still need to compare with the ACS. I think they are better because the way the old methodology worked, it would always shift households starting at 1-->7+ or 7+-->1. For the most part, the changes were increases, which is why in 82 the data is much lower in HHS1, and mostly higher in HHS2+, especially in 7+.

The new methodology uses the same technique as the 1D integerizer where it's a weighted random shifting, which I think makes the output of 187 less skewed. But again, I think the ACS will be the final determining factor here, if we match better or worse with the new methodology

@Eric-Liu-SANDAG
Copy link
Contributor Author

Actually, I'm not even sure if the ACS is the best final check, as all this processing in the first place is to correct a known error in ACS data... But we'll see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Speed up the Household Characteristics module

2 participants