- Model Overview
- What Gets Flagged
- Outlier Reasons
- Model Run Modes
- Pipeline and DVC Integration
- Developing the Sales Val Pipeline and Querying Your Development Flags in Athena
- Structure of the Output Tables
- Other Key Views
- Exporting Flags to iasWorld
This repository contains code to identify and flag sales that may be non-arms-length transactions. A non-arms-length sale occurs when the buyer and seller have a relationship that influences the transaction price, leading to a sale that doesn't reflect the true market value of the property.
The sales validation model (hereafter referred to as "the model") uses simple statistics and heuristics to identify such sales. Each sale is assigned to a statistical group defined by its geographic area, property type, and a rolling time window. The model calculates the mean and standard deviation of log sale price and price per square foot within each group, then flags any sale that falls beyond a configured number of standard deviations from its group's mean. It also uses a variety of common heuristics, such as matching last names, foreclosure information, etc.
Non-arms-length transactions can affect any process that uses sales data. As such, we currently use the output of this model, along with human analyst review information, to exclude flagged transactions from:
- The training data of our valuation models
- Sales ratio statistics reports produced by our valuation models
In the future, it is likely the flagging outputs from this model will be used further in public-facing ratio studies and reporting, as well as used internally by the office.
Sales are flagged based on the defined training data window, consisting of the most recent X years used for model training. When required, earlier historical periods outside this window are also flagged to support feature engineering.
Ongoing sales are flagged on an ad-hoc basis as they are collected by the Department of Revenue and made available to the Data Department. See Model run modes for more information.
Commercial, industrial, and land-only property sales are not flagged by this model. Residential and condominium sales are flagged with the following specifications:
Residential
- Includes classes:
- 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 218, 219, 234, 278, 295
- Excludes:
- Multi-card sales
Condos
- Includes classes:
- 297, 299, 399
- Excludes:
- Parking spaces
- Building common areas
Common exclusions for both residential and condos
- Excludes any sales with a duplicate price within 365 days
- Excludes sales less than $10,000
- Excludes multi-PIN sales
We generate up to 3 outlier reasons for any given sale. The columns are denoted as sv_outlier_reason, sv_outlier_reason2, and sv_outlier_reason3.
Sales can have a non-null sv_outlier_reason column and still not be classified as an outlier.
sv_is_outlier is the column which tells us whether or not a sale is an outlier. The sv_is_outlier column is True only if one of the price outlier
reasons is assigned to the sale. These price reasons are generated by determining the number of standard deviations a property's sale price is away from
the mean of similar properties. The other outlier reasons are purely supplementary information.
The following is a list of all current outlier reasons:
| Indicator | Description |
|---|---|
| High price | Sale price is a certain number of standard deviations above the mean of the sales in its group |
| Low price | Sale price is a certain number of standard deviations below the mean of the sales in its group |
| High price per square foot | Sale price per sqft is a certain number of standard deviations above the mean of the sales in its group |
| Low price per square foot | Sale price per sqft is a certain number of standard deviations above the mean of the sales in its group |
| Raw price threshoid | Sale price is over a manually set threshold. Implemented to catch very expensive non-represenative homes |
| PTAX - 203 Exclusion | The PTAX-203 form is required by the Illinois Department of Revenue for most property transfers. Certain fields on this form are highly indicative of a non-arms-length transaction, i.e. Question 10 indicating a short sale. |
| Short-term owner | The sale does not meet a given threshold for days since prior transaction |
| Family sale | Last name match between buyer and seller |
| Non-person sale | Flagged keyword suggests the sale involves a non-person legal entity (industrial buyer, bank, real estate firm, construction, etc.). |
| Statistical Anomaly | Flagged via an unsupervised machine learning model (isolation forest). |
| Price swing / Home flip | Large swing away from mean + short-term owner |
The model can be executed in three distinct run modes, depending on the state of the sales data and the specific requirements for flagging. These modes are configurable through the manual_update and manual_update_only_new_sales variables in our config file src/inputs.yaml
- Initial Flagging: This mode is triggered when no sales have been flagged. It's the first step in the model to instantiate tables and flag sales. It's also useful to use this mode for quick development testing, as it has the least overhead.
- Manual Update: This mode is used when sales need to be re-flagged, either due to errors or methodology updates. This allows for the selective re-flagging of sales. It also assigns flags to unflagged sales.
- Manual Update (New Sales Only): This mode borrows much of the same logic as the normal 'Manual Update' mode, but is used only to flag sales that do not have a current sales-val model determination. It will not re-flag any sales like the normal 'Manual Update' would.
graph TB
subgraph initial["Initial Run Mode"]
direction LR
A1{{"No sales are flagged"}}
B1[Run the pipeline]
C1[Flag sales as outliers or non-outliers<br>with Version = 1]
D1[Save results to S3 with <br>unique run ID]
E1[Join flags to<br>default.vw_pin_sale]
A1 -->|Initial setup| B1
B1 -->|Flag sales| C1
C1 -->|Store flags| D1
D1 -->|Persist results| E1
end
subgraph manual["Manual Update Mode"]
direction LR
A3{{"Sales must be re-flagged"}}
B3{{"Set manual_update=True<br> in src/inputs.yaml"}}
C3[Run pipeline]
D3[Increment version if <br>sale already flagged]
E3[Assign Version = 1 <br>if sale unflagged]
F3[Save results to S3 with new run ID]
G3[Update flags in <br>vw_pin_sale]
A3 -->|Manual selection| B3
B3 -->|Run update| C3
C3 -->|Version check| D3
D3 -->|Update process| F3
C3 -->|New flag| E3
E3 -->|Update process| F3
F3 -->|Persist results| G3
end
subgraph new_only["Man. Update Only New Sales"]
direction LR
A4{{"Flag only new sales"}}
B4[Set manual_update_only_new_sales<br> = True]
C4[Run pipeline]
D4[Identify sales with no current model determination]
E4[Assign Version = 1<br> if sale unflagged]
F4[Save results to S3 with new run ID]
G4[Update flags in <br>default.vw_pin_sale]
A4 -->|Filter new sales| B4
B4 -->|Run pipeline| C4
C4 -->|Run update| D4
D4 -->|New flag only| E4
E4 -->|Update process| F4
F4 -->|Persist results| G4
end
initial ~~~ manual
manual ~~~ new_only
style A1 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
style A3 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
style A4 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
The pipeline is split up into 3 stages:
src/00_ingest.py- Queries the input data needed to run the pipelinesrc/01_flag.py- Where the flagging model runs and the flags are assigned.src/02_upload.py- Grabs the outputs from01_flag.pyand uploads them to S3, making the flags available through athena.
This repository uses DVC in 2 ways:
- The input data is versioned, tracked, and stored using DVC. Previous input data sets are stored on S3 starting after the DVC PR landed in Nov 2025.
- DVC pipelines are used to sequentially run pipeline scripts and track/cache inputs and outputs.
To pull all the necessary input data based on the information in
dvc.lock, run:
dvc pullTo run the entire pipeline (excluding the export stage), run:
dvc reproNote that each stage will run only if necessary i.e. the ingest stage will not run if no parameters (ins, outs, deps) related to that stage have changed. To force a stage to re-run, run:
# Change ingest to any stage name
dvc repro -f ingestTo force the entire pipeline to re-run, run:
dvc repro -foutput_environment: "dev" "prod": writes to production tables & S3 paths"dev": writes to user-scoped dev tables & S3 paths
Your setup process depends on which environment you selected in src/inputs.yaml.
No additional setup is required.
Once 02_upload.py writes outputs to S3, the data will automatically be available in Athena.
If you've already completed the one-time setup for your development tables,
the data will be queryable at z_dev_${USER}_sale.
Otherwise, one-time setup is required to make your data queryable in Athena.
In the development environment, Athena tables are not automatically created when data is written to S3. Instead, this is handled through an AWS Glue Crawler.
After running 02_upload.py:
Your outputs should now be written to your user-scoped development path.
Because of how AWS Glue manages resources:
- Each user needs their own crawler
- Crawlers populate tables in athena database:
z_dev_${USER}_sale
You will need a dedicated Glue Crawler to register your dev tables in Athena.
Reach out to a senior staff member for assistance in the creation of this
crawler. They can copy existing crawler configs from an existing crawler with
a name like z_dev_*-ccao-data-warehouse-dev-sale-crawler.
Once created, run the crawler to:
- Scan your dev S3 output
- Create/update tables in Athena
After the crawler completes, your development data will be queryable in Athena.
All flagging runs populate 3 Athena tables with metadata, flag results, and other information. These tables can be used to determine why an individual sale was flagged as an outlier. The structure of the tables is:
erDiagram
flag }|--|| metadata : describes
flag }|--|{ parameter : describes
flag }|--|{ group_mean : describes
flag {
string meta_sale_document_num PK
bigint meta_sale_price_original
date rolling_window
boolean sv_is_outlier
boolean sv_is_ptax_outlier
boolean ptax_flag_original
boolean sv_is_heuristic_outlier
string sv_outlier_reason1
string sv_outlier_reason2
string sv_outlier_reason3
string group
string run_id FK
bigint version PK
double sv_price_deviation
double sv_price_per_sqft_deviation
}
metadata {
string run_id PK
string long_commit_sha
string short_commit_sha
string run_timestamp
string run_type
string run_note
string dvc_md5_sales_ingest
}
parameter {
string run_id PK
bigint sales_flagged
timestamp earliest_data_ingest
timestamp latest_data_ingest
string run_filter
string iso_forest_cols
string stat_groups
string sales_to_write_filter
bigint rolling_window
string time_frame
bigint short_term_owner_threshold
bigint min_group_thresh
string standard_deviation_bounds
string housing_market_class_codes
bigint raw_price_threshold
}
group_mean {
string group PK
double group_mean
double group_std
double group_sqft_std
double group_sqft_mean
bigint group_size
string run_id PK
}
In addition to the Athena output tables, two warehouse views are commonly used when debugging/exploring the sales validation flags:
sale.vw_flag
Provides the current flag status of each sale.
The underlying sale.flag table stores historical versions of flags. This view resolves those into the most recent determination per sale (doc_no). This is the version that is held in default.vw_pin_sale.
sale.vw_flag_group
Provides context about the statistical group used in flagging.
Includes whether a sale’s group met the minimum observation threshold (meets_group_threshold) required for standard deviation-based outlier detection.
In some cases, a sale may still be flagged even if the threshold was not met — typically due to PTAX-203 indicators combined with extreme price deviation.
Use the scripts/export.py script to generate a CSV that can be uploaded to iasWorld to save new flags.
Example use:
python3 scripts/export.py > sales_val_flags.csv
The sales_val_flags.csv file can then be sent over for upload to iasWorld.