br_inep_educacao_especial#1179
Conversation
|
@laribritto Esse PR vai ser mergeado ou será cancelado? Estou fazendo uma limpa nos PRs abertos |
vai ser mergeado, vou pedir a @aspeddro para revisar |
|
@laribritto esse pull request tem conflitos 😩 |
aspeddro
left a comment
There was a problem hiding this comment.
Nesse PR não tem os arquivos sql para levar para prod.
Você deve adicionar eles no PR
aspeddro
left a comment
There was a problem hiding this comment.
Tudo certo!!
Antes de mesclar atualiza a cobertura temporal no backend desas tabelas que você atualizou https://backend.basedosdados.org/admin/v1/dataset/f8ab4a9d-7457-4f5f-8a50-9eec334e9abe/change/?_changelist_filters=q%3Despecial#general-tab
📝 WalkthroughWalkthroughThis PR refactors INEP special education data pipelines by converting two Jupyter notebooks to Python scripts, adding two new Python ETL scripts, and adding minor formatting (blank lines after ChangesData Model & ETL Pipeline Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The PR involves substantial new code (four Python ETL scripts totaling ~800 lines) with parallel logic patterns across scripts, making repetitive validation easier, offset by the need to verify data filtering logic, schema transformations, and BigQuery integration steps across multiple files.
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py (1)
16-23: 💤 Low valueFunction
read_sheetis defined but never used.This function is defined but never called in the script. The script uses
excel_data.parse()directly instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py` around lines 16 - 23, The helper function read_sheet is defined but unused; replace direct calls to excel_data.parse(...) with this helper (or remove the helper if you prefer direct use). Locate the read_sheet definition and callers that currently use excel_data.parse (search for excel_data.parse or pd.ExcelFile.parse) and update those call sites to call read_sheet(excel_data, sheet_name=<name>, skiprows=<n>) so the utility is used consistently, or alternatively delete the read_sheet function and its import if you decide to keep using excel_data.parse everywhere.models/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.py (1)
16-21: 💤 Low valueFunction
read_sheetis defined but never used.This function is defined but never called. The script uses
excel_data.parse()directly instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.py` around lines 16 - 21, The helper function read_sheet is defined but never used; replace direct calls to excel_data.parse(...) with read_sheet(sheet_name, skiprows) (or remove read_sheet if you prefer to keep using excel_data.parse) so the helper is utilized—search for usages of excel_data.parse and update them to call read_sheet(sheet_name, skiprows=...) (referencing the read_sheet function and existing excel_data.parse calls) ensuring the same file path and skiprows behavior is preserved.models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py (1)
16-21: 💤 Low valueFunction
read_sheetis defined but never used.This function is defined but never called in the script. The script uses
excel_data.parse()directly instead. Consider removing the unused function or utilizing it for consistency with other scripts.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py` around lines 16 - 21, The helper function read_sheet(sheet_name: str, skiprows: int = 3) is defined but never used; either delete this dead function or switch the existing excel_data.parse(...) calls to use read_sheet for consistency. Locate usages of excel_data.parse in this script and replace them with calls to read_sheet(sheet_name, skiprows) (or adjust read_sheet to accept a file/path parameter if you prefer calling it with a dynamic path), or if you choose removal simply delete the read_sheet definition and any related imports to avoid unused-code warnings.models/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.py (1)
16-23: 💤 Low valueFunction
read_sheetis defined but never used.This function is defined but never called in the script. The script uses
excel_data.parse()directly instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.py` around lines 16 - 23, The helper function read_sheet(df: pd.ExcelFile, sheet_name: str, skiprows: int) is defined but never used; either remove this unused function or update the code to use it instead of direct excel_data.parse() calls. If you choose to use it, replace occurrences of excel_data.parse(sheet_name=..., skiprows=...) with read_sheet(excel_data, sheet_name=..., skiprows=...) making sure the argument types match (pd.ExcelFile for the first param) and adjust any call sites accordingly; if you delete it, remove the read_sheet definition to avoid dead code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@models/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.py`:
- Around line 111-119: The extraction using split("_") on
melted_dataframe["metrica"] is wrong because metrica now holds full Portuguese
labels (e.g., "Ensino Fundamental – Anos Iniciais"); instead set
melted_dataframe["etapa_ensino"] directly from melted_dataframe["metrica"] (no
split) and stop deriving tipo_metrica from underscores—either drop tipo_metrica
or set it to a fixed identifier (e.g., "tdi") as appropriate, then update or
remove the pivot_table call that expected tipo_metrica as a separate key so the
pivot operates on the actual tdi numeric column (melted_dataframe["tdi"]) and
produces the correct shape.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py`:
- Around line 141-143: The output directory string currently uses
"educacao_especial_brasil_distorcao_idade_serie" which is incorrect for this
UF-level script; update the directory name used when building path (the line
assigning path using os.path.join(OUTPUT, ...)) to
"educacao_especial_uf_distorcao_idade_serie" and keep os.makedirs(path,
exist_ok=True) as-is so the correct UF directory is created; verify any other
references in this module that reference the old "brasil" name and update them
to the "uf" variant to remain consistent with the target table.
- Around line 106-114: The current extraction of etapa_ensino and tipo_metrica
from melted_dataframe["metrica"] uses underscore splitting but metrica has been
renamed to full Portuguese labels (via RENAME_COLUMNS), so split("_") returns
the whole label and tipo_metrica will be wrong; fix by either performing the
melt operation before applying RENAME_COLUMNS so the original metric keys (that
contain "tdi_*") are available for splitting, or change the extraction to map
the full labels to etapa_ensino directly and set tipo_metrica = "tdi"
explicitly; update any downstream use (e.g., the pivot_table call that expects
tipo_metrica == "tdi") to rely on the corrected fields.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py`:
- Around line 172-174: The output directory name is incorrect: the path variable
is set to os.path.join(OUTPUT, "educacao_especial_brasil_taxa_rendimento") and
then created with os.makedirs; change that string to
"educacao_especial_uf_taxa_rendimento" so the path reflects UF-level processing
(update the literal in the assignment to path and keep the os.makedirs(path,
exist_ok=True) call unchanged).
---
Nitpick comments:
In
`@models/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.py`:
- Around line 16-21: The helper function read_sheet is defined but never used;
replace direct calls to excel_data.parse(...) with read_sheet(sheet_name,
skiprows) (or remove read_sheet if you prefer to keep using excel_data.parse) so
the helper is utilized—search for usages of excel_data.parse and update them to
call read_sheet(sheet_name, skiprows=...) (referencing the read_sheet function
and existing excel_data.parse calls) ensuring the same file path and skiprows
behavior is preserved.
In
`@models/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.py`:
- Around line 16-23: The helper function read_sheet(df: pd.ExcelFile,
sheet_name: str, skiprows: int) is defined but never used; either remove this
unused function or update the code to use it instead of direct
excel_data.parse() calls. If you choose to use it, replace occurrences of
excel_data.parse(sheet_name=..., skiprows=...) with read_sheet(excel_data,
sheet_name=..., skiprows=...) making sure the argument types match (pd.ExcelFile
for the first param) and adjust any call sites accordingly; if you delete it,
remove the read_sheet definition to avoid dead code.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py`:
- Around line 16-21: The helper function read_sheet(sheet_name: str, skiprows:
int = 3) is defined but never used; either delete this dead function or switch
the existing excel_data.parse(...) calls to use read_sheet for consistency.
Locate usages of excel_data.parse in this script and replace them with calls to
read_sheet(sheet_name, skiprows) (or adjust read_sheet to accept a file/path
parameter if you prefer calling it with a dynamic path), or if you choose
removal simply delete the read_sheet definition and any related imports to avoid
unused-code warnings.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py`:
- Around line 16-23: The helper function read_sheet is defined but unused;
replace direct calls to excel_data.parse(...) with this helper (or remove the
helper if you prefer direct use). Locate the read_sheet definition and callers
that currently use excel_data.parse (search for excel_data.parse or
pd.ExcelFile.parse) and update those call sites to call read_sheet(excel_data,
sheet_name=<name>, skiprows=<n>) so the utility is used consistently, or
alternatively delete the read_sheet function and its import if you decide to
keep using excel_data.parse everywhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8e2c176f-313e-47eb-b5e6-41d989d60481
📒 Files selected for processing (12)
models/br_inep_educacao_especial/br_inep_educacao_especial__brasil_distorcao_idade_serie.sqlmodels/br_inep_educacao_especial/br_inep_educacao_especial__brasil_taxa_rendimento.sqlmodels/br_inep_educacao_especial/br_inep_educacao_especial__uf_distorcao_idade_serie.sqlmodels/br_inep_educacao_especial/br_inep_educacao_especial__uf_taxa_rendimento.sqlmodels/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.ipynbmodels/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.pymodels/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.ipynbmodels/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.pymodels/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.ipynbmodels/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.pymodels/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.ipynbmodels/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py
💤 Files with no reviewable changes (2)
- models/br_inep_educacao_especial/code/educacao_especial_brasil_taxa_rendimento.ipynb
- models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.ipynb
| melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"].apply( | ||
| lambda v: v.split("_")[-1] | ||
| ) # Extracts 'anosiniciais', 'anosfinais', or 'ensinomedio' | ||
| melted_dataframe["tipo_metrica"] = melted_dataframe["metrica"].apply( | ||
| lambda v: v.split("_")[0] | ||
| ) # Extracts 'tdi' | ||
| melted_dataframe["tdi"] = pd.to_numeric( | ||
| melted_dataframe["tdi"], errors="coerce" | ||
| ) |
There was a problem hiding this comment.
The etapa_ensino extraction logic will not work as intended.
Same issue as in educacao_especial_uf_distorcao_idade_serie.py: After renaming, the metric values are full Portuguese labels like "Ensino Fundamental – Anos Iniciais" without underscores. The split("_") operations will return the entire string for both etapa_ensino and tipo_metrica, causing the pivot to produce unexpected results.
🔧 Suggested fix
Since the metric column already contains the education stage name, assign it directly:
-melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"].apply(
- lambda v: v.split("_")[-1]
-) # Extracts 'anosiniciais', 'anosfinais', or 'ensinomedio'
-melted_dataframe["tipo_metrica"] = melted_dataframe["metrica"].apply(
- lambda v: v.split("_")[0]
-) # Extracts 'tdi'
+melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"]Then remove or adjust the pivot_table operation since the data structure no longer requires pivoting by tipo_metrica.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@models/br_inep_educacao_especial/code/educacao_especial_brasil_distorcao_idade_serie.py`
around lines 111 - 119, The extraction using split("_") on
melted_dataframe["metrica"] is wrong because metrica now holds full Portuguese
labels (e.g., "Ensino Fundamental – Anos Iniciais"); instead set
melted_dataframe["etapa_ensino"] directly from melted_dataframe["metrica"] (no
split) and stop deriving tipo_metrica from underscores—either drop tipo_metrica
or set it to a fixed identifier (e.g., "tdi") as appropriate, then update or
remove the pivot_table call that expected tipo_metrica as a separate key so the
pivot operates on the actual tdi numeric column (melted_dataframe["tdi"]) and
produces the correct shape.
| melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"].apply( | ||
| lambda v: v.split("_")[-1] | ||
| ) # Extracts 'anosiniciais', 'anosfinais', or 'ensinomedio' | ||
| melted_dataframe["tipo_metrica"] = melted_dataframe["metrica"].apply( | ||
| lambda v: v.split("_")[0] | ||
| ) # Extracts 'tdi' | ||
| melted_dataframe["tdi"] = pd.to_numeric( | ||
| melted_dataframe["tdi"], errors="coerce" | ||
| ) |
There was a problem hiding this comment.
The etapa_ensino extraction logic will not work as intended.
After the RENAME_COLUMNS mapping, the metric column values are full Portuguese labels like "Ensino Fundamental – Anos Iniciais", "Ensino Fundamental – Anos Finais", and "Ensino Médio Regular". These strings do not contain underscores, so v.split("_")[-1] will return the entire string unchanged, and v.split("_")[0] will also return the entire string.
This means etapa_ensino will contain the full label (which may be acceptable) but tipo_metrica will also contain the full label rather than just "tdi", causing the pivot to produce unexpected column names.
🔧 Suggested fix: Use the original column names in melt or adjust the extraction logic
Either melt before renaming columns, or directly assign the metrica values to etapa_ensino since they already represent the education stage:
-melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"].apply(
- lambda v: v.split("_")[-1]
-) # Extracts 'anosiniciais', 'anosfinais', or 'ensinomedio'
-melted_dataframe["tipo_metrica"] = melted_dataframe["metrica"].apply(
- lambda v: v.split("_")[0]
-) # Extracts 'tdi'
+melted_dataframe["etapa_ensino"] = melted_dataframe["metrica"]Then remove the pivot_table operation since the data is already in the correct format with tdi as the value column.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py`
around lines 106 - 114, The current extraction of etapa_ensino and tipo_metrica
from melted_dataframe["metrica"] uses underscore splitting but metrica has been
renamed to full Portuguese labels (via RENAME_COLUMNS), so split("_") returns
the whole label and tipo_metrica will be wrong; fix by either performing the
melt operation before applying RENAME_COLUMNS so the original metric keys (that
contain "tdi_*") are available for splitting, or change the extraction to map
the full labels to etapa_ensino directly and set tipo_metrica = "tdi"
explicitly; update any downstream use (e.g., the pivot_table call that expects
tipo_metrica == "tdi") to rely on the corrected fields.
| path = os.path.join(OUTPUT, "educacao_especial_brasil_distorcao_idade_serie") | ||
| # Create the directory if it does not already exist | ||
| os.makedirs(path, exist_ok=True) |
There was a problem hiding this comment.
Output path incorrectly references "brasil" instead of "uf".
This script processes UF-level data, but the output directory is named "educacao_especial_brasil_distorcao_idade_serie". This should be "educacao_especial_uf_distorcao_idade_serie" for consistency with the script's purpose and the target table name.
🐛 Proposed fix
-path = os.path.join(OUTPUT, "educacao_especial_brasil_distorcao_idade_serie")
+path = os.path.join(OUTPUT, "educacao_especial_uf_distorcao_idade_serie")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| path = os.path.join(OUTPUT, "educacao_especial_brasil_distorcao_idade_serie") | |
| # Create the directory if it does not already exist | |
| os.makedirs(path, exist_ok=True) | |
| path = os.path.join(OUTPUT, "educacao_especial_uf_distorcao_idade_serie") | |
| # Create the directory if it does not already exist | |
| os.makedirs(path, exist_ok=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_distorcao_idade_serie.py`
around lines 141 - 143, The output directory string currently uses
"educacao_especial_brasil_distorcao_idade_serie" which is incorrect for this
UF-level script; update the directory name used when building path (the line
assigning path using os.path.join(OUTPUT, ...)) to
"educacao_especial_uf_distorcao_idade_serie" and keep os.makedirs(path,
exist_ok=True) as-is so the correct UF directory is created; verify any other
references in this module that reference the old "brasil" name and update them
to the "uf" variant to remain consistent with the target table.
| path = os.path.join(OUTPUT, "educacao_especial_brasil_taxa_rendimento") | ||
| # Create the directory if it does not already exist | ||
| os.makedirs(path, exist_ok=True) |
There was a problem hiding this comment.
Output path incorrectly references "brasil" instead of "uf".
This script processes UF-level data, but the output directory is named "educacao_especial_brasil_taxa_rendimento". This should be "educacao_especial_uf_taxa_rendimento" for consistency with the script's purpose and the target table name.
🐛 Proposed fix
-path = os.path.join(OUTPUT, "educacao_especial_brasil_taxa_rendimento")
+path = os.path.join(OUTPUT, "educacao_especial_uf_taxa_rendimento")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| path = os.path.join(OUTPUT, "educacao_especial_brasil_taxa_rendimento") | |
| # Create the directory if it does not already exist | |
| os.makedirs(path, exist_ok=True) | |
| path = os.path.join(OUTPUT, "educacao_especial_uf_taxa_rendimento") | |
| # Create the directory if it does not already exist | |
| os.makedirs(path, exist_ok=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@models/br_inep_educacao_especial/code/educacao_especial_uf_taxa_rendimento.py`
around lines 172 - 174, The output directory name is incorrect: the path
variable is set to os.path.join(OUTPUT,
"educacao_especial_brasil_taxa_rendimento") and then created with os.makedirs;
change that string to "educacao_especial_uf_taxa_rendimento" so the path
reflects UF-level processing (update the literal in the assignment to path and
keep the os.makedirs(path, exist_ok=True) call unchanged).
Summary by CodeRabbit
Refactor
Style