Skip to content

ccao-data/model-sales-val

Repository files navigation

Table of Contents

Model Overview

This repository contains code to identify and flag sales that may be non-arms-length transactions. A non-arms-length sale occurs when the buyer and seller have a relationship that influences the transaction price, leading to a sale that doesn't reflect the true market value of the property.

The sales validation model (hereafter referred to as "the model") uses simple statistics and heuristics to identify such sales. Each sale is assigned to a statistical group defined by its geographic area, property type, and a rolling time window. The model calculates the mean and standard deviation of log sale price and price per square foot within each group, then flags any sale that falls beyond a configured number of standard deviations from its group's mean. It also uses a variety of common heuristics, such as matching last names, foreclosure information, etc.

Non-arms-length transactions can affect any process that uses sales data. As such, we currently use the output of this model, along with human analyst review information, to exclude flagged transactions from:

In the future, it is likely the flagging outputs from this model will be used further in public-facing ratio studies and reporting, as well as used internally by the office.

What Gets Flagged

Sales are flagged based on the defined training data window, consisting of the most recent X years used for model training. When required, earlier historical periods outside this window are also flagged to support feature engineering.

Ongoing sales are flagged on an ad-hoc basis as they are collected by the Department of Revenue and made available to the Data Department. See Model run modes for more information.

Commercial, industrial, and land-only property sales are not flagged by this model. Residential and condominium sales are flagged with the following specifications:

Residential

  • Includes classes:
    • 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 218, 219, 234, 278, 295
  • Excludes:
    • Multi-card sales

Condos

  • Includes classes:
    • 297, 299, 399
  • Excludes:
    • Parking spaces
    • Building common areas

Common exclusions for both residential and condos

  • Excludes any sales with a duplicate price within 365 days
  • Excludes sales less than $10,000
  • Excludes multi-PIN sales

Outlier Reasons

We generate up to 3 outlier reasons for any given sale. The columns are denoted as sv_outlier_reason, sv_outlier_reason2, and sv_outlier_reason3. Sales can have a non-null sv_outlier_reason column and still not be classified as an outlier.

sv_is_outlier is the column which tells us whether or not a sale is an outlier. The sv_is_outlier column is True only if one of the price outlier reasons is assigned to the sale. These price reasons are generated by determining the number of standard deviations a property's sale price is away from the mean of similar properties. The other outlier reasons are purely supplementary information.

The following is a list of all current outlier reasons:

Indicator Description
High price Sale price is a certain number of standard deviations above the mean of the sales in its group
Low price Sale price is a certain number of standard deviations below the mean of the sales in its group
High price per square foot Sale price per sqft is a certain number of standard deviations above the mean of the sales in its group
Low price per square foot Sale price per sqft is a certain number of standard deviations above the mean of the sales in its group
Raw price threshoid Sale price is over a manually set threshold. Implemented to catch very expensive non-represenative homes
PTAX - 203 Exclusion The PTAX-203 form is required by the Illinois Department of Revenue for most property transfers. Certain fields on this form are highly indicative of a non-arms-length transaction, i.e. Question 10 indicating a short sale.
Short-term owner The sale does not meet a given threshold for days since prior transaction
Family sale Last name match between buyer and seller
Non-person sale Flagged keyword suggests the sale involves a non-person legal entity (industrial buyer, bank, real estate firm, construction, etc.).
Statistical Anomaly Flagged via an unsupervised machine learning model (isolation forest).
Price swing / Home flip Large swing away from mean + short-term owner

Model run modes

The model can be executed in three distinct run modes, depending on the state of the sales data and the specific requirements for flagging. These modes are configurable through the manual_update and manual_update_only_new_sales variables in our config file src/inputs.yaml

  1. Initial Flagging: This mode is triggered when no sales have been flagged. It's the first step in the model to instantiate tables and flag sales. It's also useful to use this mode for quick development testing, as it has the least overhead.
  2. Manual Update: This mode is used when sales need to be re-flagged, either due to errors or methodology updates. This allows for the selective re-flagging of sales. It also assigns flags to unflagged sales.
  3. Manual Update (New Sales Only): This mode borrows much of the same logic as the normal 'Manual Update' mode, but is used only to flag sales that do not have a current sales-val model determination. It will not re-flag any sales like the normal 'Manual Update' would.
graph TB
    subgraph initial["Initial Run Mode"]
        direction LR
        A1{{"No sales are flagged"}}
        B1[Run the pipeline]
        C1[Flag sales as outliers or non-outliers<br>with Version = 1]
        D1[Save results to S3 with <br>unique run ID]
        E1[Join flags to<br>default.vw_pin_sale]

        A1 -->|Initial setup| B1
        B1 -->|Flag sales| C1
        C1 -->|Store flags| D1
        D1 -->|Persist results| E1
    end

    subgraph manual["Manual Update Mode"]
        direction LR
        A3{{"Sales must be re-flagged"}}
        B3{{"Set manual_update=True<br> in src/inputs.yaml"}}
        C3[Run pipeline]
        D3[Increment version if <br>sale already flagged]
        E3[Assign Version = 1 <br>if sale unflagged]
        F3[Save results to S3 with new run ID]
        G3[Update flags in <br>vw_pin_sale]

        A3 -->|Manual selection| B3
        B3 -->|Run update| C3
        C3 -->|Version check| D3
        D3 -->|Update process| F3
        C3 -->|New flag| E3
        E3 -->|Update process| F3
        F3 -->|Persist results| G3
    end

    subgraph new_only["Man. Update Only New Sales"]
        direction LR
        A4{{"Flag only new sales"}}
        B4[Set manual_update_only_new_sales<br> = True]
        C4[Run pipeline]
        D4[Identify sales with no current model determination]
        E4[Assign Version = 1<br> if sale unflagged]
        F4[Save results to S3 with new run ID]
        G4[Update flags in <br>default.vw_pin_sale]


        A4 -->|Filter new sales| B4
        B4 -->|Run pipeline| C4
        C4 -->|Run update| D4
        D4 -->|New flag only| E4
        E4 -->|Update process| F4
        F4 -->|Persist results| G4
    end

    initial ~~~ manual
    manual ~~~ new_only

    style A1 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
    style A3 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
    style A4 fill:#bbf,stroke:#333,stroke-width:2px,color:#000
Loading

Pipeline and DVC integration

Pipeline stages

The pipeline is split up into 3 stages:

  • src/00_ingest.py - Queries the input data needed to run the pipeline
  • src/01_flag.py - Where the flagging model runs and the flags are assigned.
  • src/02_upload.py - Grabs the outputs from 01_flag.py and uploads them to S3, making the flags available through athena.

DVC integration

This repository uses DVC in 2 ways:

  1. The input data is versioned, tracked, and stored using DVC. Previous input data sets are stored on S3 starting after the DVC PR landed in Nov 2025.
  2. DVC pipelines are used to sequentially run pipeline scripts and track/cache inputs and outputs.

To pull all the necessary input data based on the information in dvc.lock, run:

dvc pull

To run the entire pipeline (excluding the export stage), run:

dvc repro

Note that each stage will run only if necessary i.e. the ingest stage will not run if no parameters (ins, outs, deps) related to that stage have changed. To force a stage to re-run, run:

# Change ingest to any stage name
dvc repro -f ingest

To force the entire pipeline to re-run, run:

dvc repro -f

Developing the sales val pipeline and querying your development flags in athena

Choose output target (in src/inputs.yaml)

output_environment: "dev" 
  • "prod": writes to production tables & S3 paths
  • "dev": writes to user-scoped dev tables & S3 paths

Your setup process depends on which environment you selected in src/inputs.yaml.

If using "prod"

No additional setup is required.

Once 02_upload.py writes outputs to S3, the data will automatically be available in Athena.


If using "dev"

If you've already completed the one-time setup for your development tables, the data will be queryable at z_dev_${USER}_sale.

Otherwise, one-time setup is required to make your data queryable in Athena.

In the development environment, Athena tables are not automatically created when data is written to S3. Instead, this is handled through an AWS Glue Crawler.

After running 02_upload.py:

Step 1 — Confirm data exists in the dev S3 bucket

Your outputs should now be written to your user-scoped development path.

Step 2 — Request a Glue Crawler

Because of how AWS Glue manages resources:

  • Each user needs their own crawler
  • Crawlers populate tables in athena database: z_dev_${USER}_sale

You will need a dedicated Glue Crawler to register your dev tables in Athena. Reach out to a senior staff member for assistance in the creation of this crawler. They can copy existing crawler configs from an existing crawler with a name like z_dev_*-ccao-data-warehouse-dev-sale-crawler.

Step 3 — Run the Crawler

Once created, run the crawler to:

  • Scan your dev S3 output
  • Create/update tables in Athena

After the crawler completes, your development data will be queryable in Athena.

Structure of the output tables

All flagging runs populate 3 Athena tables with metadata, flag results, and other information. These tables can be used to determine why an individual sale was flagged as an outlier. The structure of the tables is:

erDiagram
    flag }|--|| metadata : describes
    flag }|--|{ parameter : describes
    flag }|--|{ group_mean : describes

    flag {
        string meta_sale_document_num PK
        bigint meta_sale_price_original
        date rolling_window
        boolean sv_is_outlier
        boolean sv_is_ptax_outlier
        boolean ptax_flag_original
        boolean sv_is_heuristic_outlier
        string sv_outlier_reason1
        string sv_outlier_reason2
        string sv_outlier_reason3
        string group
        string run_id FK
        bigint version PK
        double sv_price_deviation
        double sv_price_per_sqft_deviation
    }

    metadata {
        string run_id PK
        string long_commit_sha
        string short_commit_sha
        string run_timestamp
        string run_type
        string run_note
        string dvc_md5_sales_ingest
    }

    parameter {
        string run_id PK
        bigint sales_flagged
        timestamp earliest_data_ingest
        timestamp latest_data_ingest
        string run_filter
        string iso_forest_cols
        string stat_groups
        string sales_to_write_filter
        bigint rolling_window
        string time_frame
        bigint short_term_owner_threshold
        bigint min_group_thresh
        string standard_deviation_bounds
        string housing_market_class_codes
        bigint raw_price_threshold
    }

    group_mean {
        string group PK
        double group_mean
        double group_std
        double group_sqft_std
        double group_sqft_mean
        bigint group_size
        string run_id PK
    }
Loading

Other key views

In addition to the Athena output tables, two warehouse views are commonly used when debugging/exploring the sales validation flags:

sale.vw_flag

Provides the current flag status of each sale.

The underlying sale.flag table stores historical versions of flags. This view resolves those into the most recent determination per sale (doc_no). This is the version that is held in default.vw_pin_sale.

sale.vw_flag_group

Provides context about the statistical group used in flagging.

Includes whether a sale’s group met the minimum observation threshold (meets_group_threshold) required for standard deviation-based outlier detection.

In some cases, a sale may still be flagged even if the threshold was not met — typically due to PTAX-203 indicators combined with extreme price deviation.

Exporting Flags to iasWorld

Use the scripts/export.py script to generate a CSV that can be uploaded to iasWorld to save new flags.

Example use:

python3 scripts/export.py > sales_val_flags.csv

The sales_val_flags.csv file can then be sent over for upload to iasWorld.

About

Heuristics for detecting outlier and non-arms-length sales

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages