[PULL REQUEST] New actual/implied hhp balancing methodology#208
[PULL REQUEST] New actual/implied hhp balancing methodology#208Eric-Liu-SANDAG wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Introduces a new methodology for balancing actual vs. implied household population (HHP) in the Household Characteristics module, aiming to improve runtime performance while keeping MGRA household-size distributions consistent with MGRA-level HHP controls.
Changes:
- Refactors MGRA HHP alignment from a deterministic stepwise shifting loop to a weighted-random adjustment routine applied per MGRA row.
- Adds post-adjustment validation to ensure implied min/max HHP aligns with MGRA
hhp_total, raising an error on failure. - Reshapes the adjusted wide household-size table back into the long format output via
melt.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Runtime is now approximately 14 minutes per year, even including the employment module. SELECT *, [end_date] - [start_date]
FROM [EstimatesProgram].[metadata].[run]
WHERE [run_id] = 187 |
|
Just need to do some output comparisons between old and new methodologies before this PR will be ready |
|
The following dynamic SQL query (dynamic sql my beloved 😍) compares 2024 Estimates ( DECLARE @base_run_id NVARCHAR(MAX) = '82';
DECLARE @other_run_id NVARCHAR(MAX) = '187';
DECLARE @year NVARCHAR(MAX) = '2020';
DECLARE @group_geo NVARCHAR(MAX) = 'jurisdiction';
DECLARE @query NVARCHAR(MAX) = '
WITH [base] AS (
SELECT
[run_id],
[year],
[' + @group_geo + '],
[metric],
SUM([value]) AS [' + @base_run_id + '_value]
FROM [EstimatesProgram].[outputs].[hh_characteristics]
INNER JOIN [demographic_warehouse].[dim].[mgra]
ON [hh_characteristics].[mgra] = [mgra].[mgra]
AND [series] = 15
INNER JOIN [demographic_warehouse].[dim].[mgra_xref]
ON [mgra].[mgra_id] = [mgra_xref].[mgra_id]
AND [xref_year] = 9999
WHERE [run_id] = ' + @base_run_id + '
AND [year] = ' + @year + '
AND [metric] LIKE ''%Household Size%''
GROUP BY [run_id], [year], [' + @group_geo + '], [metric]
),
[other] AS (
SELECT
[run_id],
[year],
[' + @group_geo + '],
[metric],
SUM([value]) AS [' + @other_run_id + '_value]
FROM [EstimatesProgram].[outputs].[hh_characteristics]
INNER JOIN [demographic_warehouse].[dim].[mgra]
ON [hh_characteristics].[mgra] = [mgra].[mgra]
AND [series] = 15
INNER JOIN [demographic_warehouse].[dim].[mgra_xref]
ON [mgra].[mgra_id] = [mgra_xref].[mgra_id]
AND [xref_year] = 9999
WHERE [run_id] = ' + @other_run_id + '
AND [year] = ' + @year + '
AND [metric] LIKE ''%Household Size%''
GROUP BY [run_id], [year], [' + @group_geo + '], [metric]
)
SELECT
[base].[year],
[base].[' + @group_geo + '],
[base].[metric],
[' + @base_run_id + '_value],
[' + @other_run_id + '_value]
FROM [base]
INNER JOIN [other]
ON [base].[year] = [other].[year]
AND [base].[' + @group_geo + '] = [other].[' + @group_geo + ']
AND [base].[metric] = [other].[metric]
ORDER BY [base].[year], [base].[' + @group_geo + '], [base].[metric]
'
EXEC sp_executesql @query; |
|
I think the changes are for the better, but I still need to compare with the ACS. I think they are better because the way the old methodology worked, it would always shift households starting at 1-->7+ or 7+-->1. For the most part, the changes were increases, which is why in The new methodology uses the same technique as the 1D integerizer where it's a weighted random shifting, which I think makes the output of |
|
Actually, I'm not even sure if the ACS is the best final check, as all this processing in the first place is to correct a known error in ACS data... But we'll see |
Describe this pull request. What changes are being made?
New actual/implied hhp balancing methodology. This change was made mostly for speed purposes
What issues does this pull request address?
Additional context
See the issue for old and new timing