- Prior Models
- Model Overview
- Ongoing Issues
- FAQs
- Usage
- License
- Contributing
This repository contains code, data, and documentation for the Cook County Assessor’s residential reassessment model. Information about prior year models can be found at the following links:
Year(s) | Triad(s) | Method | Language / Framework | Link |
---|---|---|---|---|
2009 - 2017 | All | Linear regression per township | SPSS | Link |
2018 | City | Linear regression per township | N/A | Not available. Values provided by vendor |
2019 | North | Linear regression or GBM model per township | R (Base) | Link |
2020 | South | Linear regression or GBM model per township | R (Base) | Link |
2021 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | Link |
2022 | North | County-wide LightGBM model | R (Tidyverse / Tidymodels) | Link |
2023 | South | County-wide LightGBM model | R (Tidyverse / Tidymodels) | Link |
2024 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | Link |
The duty of the Cook County Assessor’s Office is to value property in a fair, accurate, and transparent way. The Assessor is committed to transparency throughout the assessment process. As such, this document contains:
- A brief walkthrough of the overall process used to generate residential assessed values
- Rationale for different modeling, feature, and code decisions that affect assessed values
- An outline of ongoing data quality issues that affect assessed values
- Instructions to replicate our valuation process and results
The repository itself contains the code for the Automated Valuation Model (AVM) used to generate initial assessed values for single- and multi-family residential properties in Cook County. This system is effectively an advanced machine learning model (hereafter referred to as “the model”). It uses previous sales to generate estimated sale values (assessments) for all properties.
The ultimate goal of the model is to answer the question, “What would the sale price of every Cook County home be if it had sold last year?”
To answer this question, the model estimates the sale price (fair market value) of unsold properties using the known sale price of similar and nearby properties. Training the model involves iteratively updating a mathematical function to recognize patterns in sales data, which includes both property characteristics (such as square footage, number of bedrooms, etc.) and additional factors such as location, environmental variables (flood risk, noise), and market trends.
The full residential modeling pipeline - from raw data to final values - consists of 7 stages. Visually, the pipeline looks approximately like the flowchart below.
graph LR
aws[("AWS")]
ingest("Ingest")
train("Train")
assess("Assess")
evaluate("Evaluate")
interpret("Interpret")
finalize("Finalize")
upload("Upload")
export("Export")
ingest --> train
train --> assess
train --> interpret
assess --> evaluate
evaluate --> finalize
interpret --> finalize
finalize --> upload
finalize --> export
upload --> aws
aws --> ingest
aws --> export
All inputs and outputs are stored on AWS S3 using a unique run identifier. Each stage in the modeling pipeline corresponds to an individual R script. These scripts can be run independently (as a stand-alone script) or as part of the overall pipeline (with DVC) as long as the dependencies for the stage exist.
⚠️ NOTE: For a full technical breakdown of each stage, including dependencies, outputs, parameters, and more, see dvc.yaml
-
Ingest: Pull prepared data from the CCAO’s Athena database. This data is divided into 2 primary datasets, one for training and one for assessment. NOTE: This stage is only run as-needed, since the input data does not change for each model run.
-
Train: Train the model using sales data. This involves splitting the input data into train/test sets and performing cross-validation to determine the optimal set of hyperparameters. The primary output of this stage is a trained model object.
-
Assess: Use the trained model to estimate values for all residential properties. Values are adjusted if necessary and then aggregated to the PIN level. The primary output of this stage is a data frame of PIN-level assessed values.
-
Evaluate: Measure the performance of the model using the held-out test set and an assessor-specific ratio study method. Performance statistics include standard machine learning metrics (RMSE, MAE, MAPE) as well as assessor-specific metrics (COD, PRD, PRB, MKI). This stage calculates metrics for different levels of geography with (and without) property class breakouts. The primary output of this stage is a data frame of aggregate performance statistics.
-
Interpret: Calculate three major explanatory outputs:
- SHAP values for all the estimated values from the assess stage. These are the per feature contribution to the predicted value for an individual observation (usually a single PIN)
- Aggregate feature importance for the entire model, using the built-in LightGBM method
- An experimental set of comparable property sales, based loosely on the method described in this vignette
-
Finalize: Save run timings and metadata. Render the following Quarto documents:
- An overall model report detailing model performance, effects, and quality control tests
- For PINs of interest, individual PIN-level reports detailing the characteristics, SHAP values, and results for a given PIN
-
Upload: Upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run’s performance can be visualized using the CCAO’s internal Tableau dashboards. NOTE: This stage is only run internally, since it requires access to the CCAO Data AWS account.
-
Export: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of the main pipeline.
Despite its reputation as an easy-to-use panacea, machine learning actually involves a number of choices and trade-offs which are not always transparent or well-justified. Seemingly inane decisions by algorithm creators and data scientists can introduce systemic bias into results.
To counter this, we’ve listed the major choices we’ve made about our modeling process below, as well as the rationale behind each decision. We feel strongly that these choices lead to optimal results given the trade-offs involved, but we’re absolutely open to suggestions and criticism.
We use LightGBM for our primary valuation model. LightGBM is a GBDT (gradient-boosting decision tree) framework created and maintained by Microsoft. It has an excellent R API and has been around since 2016.
We tried a number of other model types and frameworks, including regularized linear models, XGBoost, CatBoost, random forest, shallow neural networks, and support vector machines. We even tried ensemble methods such as model stacking. We chose LightGBM because it has the right mix of trade-offs for our needs. Specifically, LightGBM is:
- Well-documented. The docs contain good explanations of LightGBM’s features and useful troubleshooting sections.
- Highly accurate. It consistently beat other methods in accuracy, as measured by RMSE (root mean squared error) using a test set.
- Extremely fast. It trained faster than other model types by a nearly 2:1 margin using our data (CPU training only).
- Capable of natively handling categorical features. This is extremely important as a large amount of our property data is categorical (type of roof, neighborhood, etc.). Other methods, such as XGBoost, require feature transformation such as one-hot encoding to use categorical data.
- Widely used in housing-specific machine learning models and competitions.
- Simpler to use and implement than ensemble methods or neural networks, which can involve lots of fiddling and configuration.
- Easy to diagnose problems with, as it has built-in feature importance and contribution methods.
The downsides of LightGBM are that it is:
- Relatively difficult to explain compared to simpler models such as linear regression.
- Not particularly well-integrated into Tidymodels, the R framework we use for machine learning. See Framework Selection.
- Painful to train, since it has a large number of hyperparameters.
- Prone to over-fitting if not trained carefully, unlike other methods such as random forest.
For a more in-depth report on the performance and accuracy trade-offs between LightGBM and XGBoost specific to our use case, please see our Model Benchmark repository.
We use Tidymodels as our primary machine-learning framework. Tidymodels is a set of R packages that work well together and with the Tidyverse. These packages abstract away complicated machine-learning logic and allow us to focus on improving our data and models.
Additionally, Tidymodels is:
- Well-documented. There are resources for quickly learning the Tidymodels approach as well as complete documentation for each Tidymodels package.
- Under very active development. Developers are quick to respond to issues and feature requests.
- Quick to teach, since a lot of complicated code is abstracted away.
- Extensible. The API allows for easy integration of additional model types. See Lightsnip.
- Verbose. It tends to warn you about common machine-learning footguns and has excellent error handling and messages.
Some downsides to Tidymodels are that it is:
- Relatively new. While its API is mature, there are still bugs in core packages.
- Under active development. Packages and features change fairly quickly, so we need to constantly update code to stay current.
We’ve create a custom R package called Lightsnip to better integrate LightGBM with Tidymodels and unlock some of its more advanced features, including:
- Early stopping, which reduces training time by stopping based on a holdout validation set
- Additional hyperparameters, particularly those related to categorical features
- The ability to link certain hyperparameters which typically move in
tandem, such as
num_leaves
andmax_depth
Lightsnip also ensures that the link between the model engine (LightGBM) and the model framework (Tidymodels) is stable. It lets us quickly respond to any upstream changes while maintaining the set of features we need.
Hyperparameters define the structure and trade-offs of models. They must be well-specified in order for a model to be accurate and useful. LightGBM has a large number of tunable parameters, but we tune only a small proportion, including:
LightGBM Parameter | CV Search Range | Parameter Description |
---|---|---|
num_iterations | 100 - 2500 | Total number of trees/iterations. Final value is dependent on CV and early stopping. |
learning_rate | 0.001 - 0.398 | Speed of training per iteration. Higher usually means faster convergence, but possibly higher overall error. |
max_bin | 50 - 512 | Maximum number of bins used to bucket continuous features |
num_leaves | 32 - 2048 | Maximum number of leaves in each tree. Main parameter to control model complexity. |
add_to_linked_depth | 1 - 7 | Amount to add to max_depth if linked to num_leaves . See max_depth . |
feature_fraction | 0.3 - 0.7 | The random subset of features selected for a tree, as a percentage. |
min_gain_to_split | 0.001 - 10000 | The minimum gain needed to create a split. |
min_data_in_leaf | 2 - 400 | The minimum data in a single tree leaf. Important to prevent over-fitting. |
max_cat_threshold | 10 - 250 | Maximum number of split points for categorical features |
min_data_per_group | 2 - 400 | Minimum number of observations per categorical group |
cat_smooth | 10 - 200 | Categorical smoothing. Used to reduce noise. |
cat_l2 | 0.001 - 100 | Categorical-specific L2 regularization |
lambda_l1 | 0.001 - 100 | L1 regularization |
lambda_l2 | 0.001 - 100 | L2 regularization |
These parameters are tuned using Bayesian hyperparameter optimization, which iteratively searches the parameter space based on the previous parameter tuning results. We use Bayesian tuning instead of grid search or random search because it trains faster and results in nearly identical final parameters.
Model accuracy for each parameter combination is measured on a validation set using rolling-origin cross-validation. Final model accuracy is measured on a test set of the most recent 10% of sales in our training sample. For final model candidates, we also measure model accuracy on a random (rather than time-based) test set to ensure the model generalizes well.
The residential model uses a variety of individual and aggregate features to determine a property’s assessed value. We’ve tested a long list of possible features over time, including walk score, crime rate, school districts, and many others. The features in the table below are the ones that made the cut. They’re the right combination of easy to understand and impute, powerfully predictive, and well-behaved.
Feature Name | Variable Name | Description | Category | Type | Possible Values (Encoded) | Possible Values (Semantic) |
---|---|---|---|---|---|---|
Percent Population Age, Under 19 Years Old | acs5_percent_age_children | Percent of the people 17 years or younger | ACS5 | numeric | ||
Percent Population Age, Over 65 Years Old | acs5_percent_age_senior | Percent of the people 65 years or older | ACS5 | numeric | ||
Median Population Age | acs5_median_age_total | Median age for whole population | ACS5 | numeric | ||
Percent Households Family, Married | acs5_percent_household_family_married | Percent of households that are family, married | ACS5 | numeric | ||
Percent Households Nonfamily, Living Alone | acs5_percent_household_nonfamily_alone | Percent of households that are non-family, alone (single) | ACS5 | numeric | ||
Percent Population Education, High School Degree | acs5_percent_education_high_school | Percent of people older than 25 who attained a high school degree | ACS5 | numeric | ||
Percent Population Education, Bachelor Degree | acs5_percent_education_bachelor | Percent of people older than 25 who attained a bachelor’s degree | ACS5 | numeric | ||
Percent Population Education, Graduate Degree | acs5_percent_education_graduate | Percent of people older than 25 who attained a graduate degree | ACS5 | numeric | ||
Percent Population Income, Below Poverty Level | acs5_percent_income_below_poverty_level | Percent of people above the poverty level in the last 12 months | ACS5 | numeric | ||
Median Income, Household in Past Year | acs5_median_income_household_past_year | Median income per household in the past 12 months | ACS5 | numeric | ||
Median Income, Per Capita in Past Year | acs5_median_income_per_capita_past_year | Median income per capita in the past 12 months | ACS5 | numeric | ||
Percent Population Income, Received SNAP in Past Year | acs5_percent_income_household_received_snap_past_year | Percent of households that received SNAP in the past 12 months | ACS5 | numeric | ||
Percent Population Employment, Unemployed | acs5_percent_employment_unemployed | Percent of people 16 years and older unemployed | ACS5 | numeric | ||
Median Occupied Household, Total, Year Built | acs5_median_household_total_occupied_year_built | Median year built for all occupied households | ACS5 | numeric | ||
Median Occupied Household, Renter, Gross Rent | acs5_median_household_renter_occupied_gross_rent | Median gross rent for only renter-occupied units | ACS5 | numeric | ||
Percent Occupied Households, Owner | acs5_percent_household_owner_occupied | Percent of households that are owner-occupied | ACS5 | numeric | ||
Year Built | char_yrblt | Year the property was constructed | Characteristic | numeric | ||
Central Air Conditioning | char_air | Indicator for central air | Characteristic | categorical | 1, 2 | Central A/C, No Central A/C |
Apartments | char_apts | Number of apartments for class 211 and 212 properties | Characteristic | categorical | 1, 2, 3, 4, 5, 6 | Two, Three, Four, Five, Six, None |
Attic Finish | char_attic_fnsh | Attic finish | Characteristic | categorical | 1, 2, 3 | Living Area, Partial, None |
Attic Type | char_attic_type | Attic type | Characteristic | categorical | 1, 2, 3 | Full, Partial, None |
Bedrooms | char_beds | Number of bedrooms in the building | Characteristic | numeric | ||
Building Square Feet | char_bldg_sf | Square footage of the building, as measured from the exterior | Characteristic | numeric | ||
Basement Type | char_bsmt | Basement type | Characteristic | categorical | 1, 2, 3, 4 | Full, Slab, Partial, Crawl |
Basement Finish | char_bsmt_fin | Basement finish | Characteristic | categorical | 1, 2, 3 | Formal Rec Room, Apartment, Unfinished |
Property Class | char_class | Card-level property type and/or use | Characteristic | character | ||
Exterior Wall Material | char_ext_wall | Exterior wall construction | Characteristic | categorical | 1, 2, 3, 4 | Frame, Masonry, Frame + Masonry, Stucco |
Full Baths | char_fbath | Number of full bathrooms | Characteristic | numeric | ||
Fireplaces | char_frpl | Number of fireplaces | Characteristic | numeric | ||
Garage 1 Attached | char_gar1_att | Indicator for garage attached | Characteristic | categorical | 1, 2 | Yes, No |
Garage 1 Ext. Wall Material | char_gar1_cnst | Garage exterior wall construction | Characteristic | categorical | 1, 2, 3, 4 | Frame, Masonry, Frame + Masonry, Stucco |
Garage 1 Size | char_gar1_size | Garage size (number of cars) | Characteristic | categorical | 1, 2, 3, 4, 5, 6, 7, 8 | 1 cars, 1.5 cars, 2 cars, 2.5 cars, 3 cars, 3.5 cars, 0 cars, 4 cars |
Half Baths | char_hbath | Number of half baths | Characteristic | numeric | ||
Land Square Feet | char_land_sf | Square footage of the land (not just the building) of the property | Characteristic | numeric | ||
Central Heating | char_heat | Interior heating type | Characteristic | categorical | 1, 2, 3, 4 | Warm Air Furnace, Hot Water Steam, Electric Heater, None |
Number of Commercial Units | char_ncu | Number of commercial units | Characteristic | numeric | ||
Porch | char_porch | Porch type | Characteristic | categorical | 0, 1, 2 | None, Frame Enclosed, Masonry Enclosed |
Roof Material | char_roof_cnst | Roof material / construction | Characteristic | categorical | 1, 2, 3, 4, 5, 6 | Shingle + Asphalt, Tar + Gravel, Slate, Shake, Tile, Other |
Rooms | char_rooms | Number of total rooms in the building (excluding baths) | Characteristic | numeric | ||
Cathedral Ceiling | char_tp_dsgn | Deprecated | Characteristic | categorical | 1, 2 | Yes, No |
Type of Residence | char_type_resd | Type of residence | Characteristic | categorical | 1, 2, 3, 4, 5, 9.9 | 1 Story, 2 Story, 3 Story +, Split Level, 1.5 Story, Missing |
Recent Renovation | char_recent_renovation | Indicates whether or not a property was renovated within the last 3 years | Characteristic | logical | ||
Longitude | loc_longitude | X coordinate in degrees (global longitude) | Location | numeric | ||
Latitude | loc_latitude | Y coordinate in degrees (global latitude) | Location | numeric | ||
Census Tract GEOID | loc_census_tract_geoid | 11-digit ACS/Census tract GEOID | Location | character | ||
First Street Factor | loc_env_flood_fs_factor | First Street flood factor The flood factor is a risk score, where 10 is the highest risk and 1 is the lowest risk | Location | numeric | ||
School Elementary District GEOID | loc_school_elementary_district_geoid | School district (elementary) GEOID | Location | character | ||
School Secondary District GEOID | loc_school_secondary_district_geoid | School district (secondary) GEOID | Location | character | ||
CMAP Walkability Score (No Transit) | loc_access_cmap_walk_nta_score | CMAP walkability score for a given PIN, excluding transit walkability | Location | numeric | ||
CMAP Walkability Total Score | loc_access_cmap_walk_total_score | CMAP walkability score for a given PIN, including transit walkability | Location | numeric | ||
Municipality Name | loc_tax_municipality_name | Taxing district name, as seen on Cook County tax bills | Location | character | ||
Township Code | meta_township_code | Cook County township code | Meta | character | ||
Neighborhood Code | meta_nbhd_code | Assessor neighborhood code | Meta | character | ||
Number of sales within previous N years of sale/lien date | meta_sale_count_past_n_years | Number of sales within previous N years of sale/lien date | Meta | numeric | ||
Property Tax Bill Aggregate Rate | other_tax_bill_rate | Tax bill rate for the taxing district containing a given PIN | Other | numeric | ||
School District (Elementary) GreatSchools Rating | other_school_district_elementary_avg_rating | Average GreatSchools rating of elementary schools within the district of a given PIN | Other | numeric | ||
School District (Secondary) GreatSchools Rating | other_school_district_secondary_avg_rating | Average GreatSchools rating of secondary schools within the district of a given PIN | Other | numeric | ||
Active Homeowner Exemption | ccao_is_active_exe_homeowner | Parcel has an active homeowner exemption | Other | logical | ||
Number of Years Active Homeowner Exemption | ccao_n_years_exe_homeowner | Number of years parcel has had an active homeowner exemption | Other | numeric | ||
Standard Deviation Distance From Parcel Centroid to Vertices (Feet) | shp_parcel_centroid_dist_ft_sd | Standard deviation of the distance from each major parcel vertex to the parcel centroid | Parcel Shape | numeric | ||
Standard Deviation Parcel Edge Length (Feet) | shp_parcel_edge_len_ft_sd | Standard deviation of the edge length between parcel vertices | Parcel Shape | numeric | ||
Standard Deviation Parcel Interior Angle (Degrees) | shp_parcel_interior_angle_sd | Standard deviation of the interior angles of the parcel polygon | Parcel Shape | numeric | ||
Ratio of Parcel Area to Minimum Rotated Bounding Rectangle | shp_parcel_mrr_area_ratio | Ratio of the parcel’s area to the area of its minimum rotated bounding rectangle | Parcel Shape | numeric | ||
Ratio of Parcel Minimum Rotated Bounding Rectangle Longest to Shortest Side | shp_parcel_mrr_side_ratio | Ratio of the longest to the shortest side of the parcel’s minimum rotated bounding rectangle | Parcel Shape | numeric | ||
Number of Parcel Vertices | shp_parcel_num_vertices | The number of vertices of the parcel | Parcel Shape | numeric | ||
Number of PINs in Half Mile | prox_num_pin_in_half_mile | Number of PINs within half mile | Proximity | numeric | ||
Number of Bus Stops in Half Mile | prox_num_bus_stop_in_half_mile | Number of bus stops within half mile | Proximity | numeric | ||
Number of Foreclosures Per 1000 PINs (Past 5 Years) | prox_num_foreclosure_per_1000_pin_past_5_years | Number of foreclosures per 1000 PINs, within half mile (past 5 years) | Proximity | numeric | ||
Average School Rating in Half Mile | prox_avg_school_rating_in_half_mile | Average school rating of schools within half mile | Proximity | numeric | ||
Total Airport Noise DNL | prox_airport_dnl_total | Estimated DNL for a PIN, assuming a baseline DNL of 50 (“quiet suburban”) and adding predicted noise from O’Hare and Midway airports to that baseline | Proximity | numeric | ||
Nearest Bike Trail Distance (Feet) | prox_nearest_bike_trail_dist_ft | Nearest bike trail distance (feet) | Proximity | numeric | ||
Nearest Cemetery Distance (Feet) | prox_nearest_cemetery_dist_ft | Nearest cemetery distance (feet) | Proximity | numeric | ||
Nearest CTA Route Distance (Feet) | prox_nearest_cta_route_dist_ft | Nearest CTA route distance (feet) | Proximity | numeric | ||
Nearest CTA Stop Distance (Feet) | prox_nearest_cta_stop_dist_ft | Nearest CTA stop distance (feet) | Proximity | numeric | ||
Nearest Hospital Distance (Feet) | prox_nearest_hospital_dist_ft | Nearest hospital distance (feet) | Proximity | numeric | ||
Lake Michigan Distance (Feet) | prox_lake_michigan_dist_ft | Distance to Lake Michigan shoreline (feet) | Proximity | numeric | ||
Nearest Metra Route Distance (Feet) | prox_nearest_metra_route_dist_ft | Nearest Metra route distance (feet) | Proximity | numeric | ||
Nearest Metra Stop Distance (Feet) | prox_nearest_metra_stop_dist_ft | Nearest Metra stop distance (feet) | Proximity | numeric | ||
Nearest Park Distance (Feet) | prox_nearest_park_dist_ft | Nearest park distance (feet) | Proximity | numeric | ||
Nearest Railroad Distance (Feet) | prox_nearest_railroad_dist_ft | Nearest railroad distance (feet) | Proximity | numeric | ||
Nearest University Distance (Feet) | prox_nearest_university_dist_ft | Nearest university distance (feet) | Proximity | numeric | ||
Nearest Vacant Land Parcel Distance (Feet) | prox_nearest_vacant_land_dist_ft | Nearest vacant land (class 100) parcel distance (feet) | Proximity | numeric | ||
Nearest Water Distance (Feet) | prox_nearest_water_dist_ft | Nearest water distance (feet) | Proximity | numeric | ||
Nearest Golf Course Distance (Feet) | prox_nearest_golf_course_dist_ft | Nearest golf course distance (feet) | Proximity | numeric | ||
Nearest Highway Distance (Feet) | prox_nearest_road_highway_dist_ft | Distance to nearest highway road | Proximity | numeric | ||
Nearest Arterial Road Distance (Feet) | prox_nearest_road_arterial_dist_ft | Distance to nearest arterial road | Proximity | numeric | ||
Nearest Collector Road Distance (Feet) | prox_nearest_road_collector_dist_ft | Distance to nearest collector road | Proximity | numeric | ||
Average Daily Traffic Count on Nearest Highway | prox_nearest_road_highway_daily_traffic | Daily traffic of nearest highway road | Proximity | numeric | ||
Average Daily Traffic Count on Nearest Arterial Road | prox_nearest_road_arterial_daily_traffic | Daily traffic of nearest arterial road | Proximity | numeric | ||
Average Daily Traffic Count on Nearest Collector Road | prox_nearest_road_collector_daily_traffic | Daily traffic of nearest collector road | Proximity | numeric | ||
Nearest New Construction (Feet) | prox_nearest_new_construction_dist_ft | Nearest new construction distance (feet) | Proximity | numeric | ||
Nearest Major Stadium (Feet) | prox_nearest_stadium_dist_ft | Nearest stadium distance (feet) | Proximity | numeric | ||
Sale Year | time_sale_year | Sale year calculated as the number of years since 0 B.C.E | Time | numeric | ||
Sale Day | time_sale_day | Sale day calculated as the number of days since January 1st, 1997 | Time | numeric | ||
Sale Quarter of Year | time_sale_quarter_of_year | Character encoding of quarter of year (Q1 - Q4) | Time | character | ||
Sale Month of Year | time_sale_month_of_year | Character encoding of month of year (Jan - Dec) | Time | character | ||
Sale Day of Year | time_sale_day_of_year | Numeric encoding of day of year (1 - 365) | Time | numeric | ||
Sale Day of Month | time_sale_day_of_month | Numeric encoding of day of month (1 - 31) | Time | numeric | ||
Sale Day of Week | time_sale_day_of_week | Numeric encoding of day of week (1 - 7) | Time | numeric | ||
Sale After COVID-19 | time_sale_post_covid | Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020) | Time | logical |
We maintain a few useful resources for working with these features:
- Once you’ve pulled the input data, you can inner join
the data to the CSV version of the data dictionary
(
docs/data-dict.csv
) to filter for only the features that we use in the model. - You can browse our data catalog to see more details about these features, in particular the residential model input view which is the source of our training data.
- You can use the
ccao
R package or its Python equivalent to programmatically convert variable names to their human-readable versions (ccao::vars_rename()
) or convert numerically-encoded variables to human-readable values (ccao::vars_recode()
. Theccao::vars_dict
object is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
We rely on numerous third-party sources to add new features to our data. These features are used in the primary valuation model and thus need to be high-quality and error-free. A non-exhaustive list of features and their respective sources includes:
Feature | Data Source |
---|---|
Tax rate | Cook County Clerk’s Office |
Airport noise | Noise monitoring stations via the Chicago Department of Aviation |
Road proximity | Buffering OpenStreetMap motorway, trunk, and primary roads |
Flood risk and direction | First Street flood data |
All Census features | ACS 5-year estimates for each respective year |
Elementary school district or attendance boundary | Cook County school district boundaries and CPS attendance boundaries |
High school district or attendance boundary | Cook County high school district boundaries and CPS high school attendance boundaries |
Walkability | The Chicago Metropolitan Agency for Planning’s ON TO 2050 Walkability Scores |
Subdivision, unincorporated areas, SSAs, etc. | Cook County GIS |
PUMA Housing Index | DePaul Institute for Housing Studies |
School Ratings | GreatSchools.org, aggregated to the district level |
Distance to CTA, PACE, Metra | Each agency’s respective GTFS feed, which contains the location of stops and lines |
Many people have intuitive assumptions about what drives the value of their home, so we often receive the question, “Is X taken into account when valuing my property?” Here’s a list of commonly-asked-about features which are not in the model, as well as rationale for why they’re excluded:
Feature | Reason It’s Excluded |
---|---|
Property condition | We track property condition, but over 98% of the properties in our data have the same condition, meaning it’s not tracked effectively and there’s not enough variation for it to be predictive of sale price. |
Crime | Crime is highly correlated with features that are already in the model, such as income and neighborhood, so it doesn’t add much predictive power. Additionally, it is difficult to reliably aggregate crime data from all of Cook County. |
Interior features such as kitchen quality or amenities | Our office can only access the outside of buildings; we can’t reliably observe interior property characteristics beyond what is available through building permits. |
Blighted building or eyesore in my neighborhood | If a specific building or thing affects sale prices in your neighborhood, this will already be reflected in the model through neighborhood fixed effects. |
Pictures of property | We don’t have a way to reliably use image data in our model, but we may include such features in the future. |
Comparable properties | The model will automatically find and use comparable properties when producing an estimate. However, the model does not explicitly use or produce a set of comparable properties. |
Flood indicator | Between the First Street flood risk and direction data, distance to water, and precise latitude and longitude for each parcel, the contribution of FEMA flood hazard data to the model approached zero. |
The model uses two primary data sets that are constructed by the ingest
stage, as well as a few secondary data sets for
valuation. These data sets are included in the input/
directory for the purpose of replication.
training_data
- Includes residential sales from the 9 years prior to the next assessment date, which gives us a sufficient amount of data for accurate prediction without including outdated price information. This is the data used to train and evaluate the model. Its approximate size is 400K rows with 100 features.assessment_data
- Includes all residential properties (sold and unsold) which need assessed values. This is the data the final model is used on. Its approximate size is 1.1 million rows with 100 features.
These data sets contain only residential single- and multi-family properties. Single-family includes property classes 202, 203, 204, 205, 206, 207, 208, 209, 210, 234, 278, and 295. Multi-family includes property classes 211 and 212. Bed and breakfast properties (class 218 and 219) are considered single-family for the sake of modeling, but are typically valued later by hand. Other residential properties, such as condominiums (class 299 and 399) are valued using a different model.
Models need data in order to be trained and measured for accuracy. Modern predictive modeling typically uses three data sets:
- A training set, used to train the parameters of the model itself.
- A validation set, used to choose a hyperparameter combination that optimizes model accuracy.
- A test set, used to measure the performance of the trained, tuned model on unseen data.
training_data
is used to create these data sets. It is subdivided
using a technique called out-of-time testing.
Out-of-time testing explicitly measures the model’s ability to predict recent sales. It holds out the most recent 10% of sales as a test set, while the remaining 90% of the data is split into training and validation sets.
The training data is further subdivided using a technique called rolling-origin resampling. For this method, a fixed window of time is used to increment the size of the training set, while the validation set is always 10% of sales immediately following the training set. This helps cross-validation determine which hyperparameters will perform best when predicting future sales.
Once we’re satisfied with the model’s performance on recent sales, we
retrain the model using the full sales sample (all rows in
training_data
). This gives the final model more (and more recent)
sales to learn from.
Finally, the model, trained on the full sales sample from
training_data
, can be used to predict assessed values for all
residential properties. To do this, we set the “sale date” of all
properties in assessment_data
to Jan 1st of the assessment year, then
use the final model to predict what the sale price would be on that
date.
These sale prices are our initial prediction for what each property is worth. They eventually become the assessed value sent to taxpayers after some further adjustments (see Post-Modeling) and hand review.
The pipeline also uses a few secondary data sets in the valuation
process. These data sets are included in input/
but are not
actually used by the model itself. They include:
char_data
- The completeassessment_data
set as well as the same data for the previous year. This data is used for automated model performance reporting rather than valuation.complex_id_data
- Complex identifiers for class 210 and 295 town/rowhomes. Intended to group like units together to ensure that nearly identical units in close proximity receive the same assessed value. This is accomplished with a “fuzzy grouping” strategy that allows slightly dissimilar characteristics.hie_data
- Home improvement exemption data used to evaluate whether the pipeline correctly updates card-level characteristics triggered by the expiration of home improvement exemptions.land_site_rate_data
- Fixed, PIN-level land values for class 210 and 295 units. Provided by the Valuations department. Not always used, so may be 0 rows for certain years.land_nbhd_rate_data
- Fixed $/sqft land rates by assessor neighborhood for residential property classes except 210 and 295. Provided by the Valuations department.
There’s a common saying in the machine learning world: “garbage in, garbage out.” This is a succinct way to say that training a predictive model with bad, unrepresentative, or biased data leads to bad results.
To help mitigate the bad data problem and ensure accurate prediction, we do our best to ensure that the sales data used to train the model is representative of the actual market and universe of properties. We accomplish this in two ways.
We use a heuristics-based approach to drop non-arms-length sales, remove outliers, and manually flag certain suspect sales. This approach was developed in partnership with the Mansueto Institute. As of 2023, the sales validation code can be found in a dedicated repository at ccao-data/model-sales-val. Please visit that repository for more information.
We also perform basic balance tests to determine if the universe of
properties sold is analogous to the universe of all properties. The code
for these tests can be found under reports/
. The goal of
the tests is to see if any characteristics are significantly predictive
of sale status, and the tests generally take the form of a logistic
regression with the following specification:
sold_in_last_2_years = β₀ + βₙcharacteristics + βₙlocation_fixed_effects + ... + ε
There a few caveats with this approach and with balance testing in general:
- There could be statistically significant omitted variables that
differentiate sold from unsold. Things like
recently_painted
orfull_kitchen_renovation
are good examples. We don’t collect these data points, so it could be the case that sold properties are more “sale-ready” in these unknown terms. - There could be significant variation by geography in the
representativeness of the sales. In other words, certain areas could
have non-representative sales whose predictive effect on
sold_in_last_2_years
is washed out due to mis- or under-specified geographic sampling.
In addition to the first-pass modeling done by LightGBM, the CCAO also
performs a set of simple adjustments on the initial predicted values
from the assess
stage. These adjustments are internally called
“post-modeling,” and are responsible for correcting minor deficiencies
in the initial predictions. Specifically, post-modeling will:
-
Aggregate values for multi-card properties to the PIN level, then disaggregate them back to the card level. A check is used to ensure that the PIN-level assessed value is not significantly greater than the prior year’s value. This is needed because often back buildings (ADUs, secondary buildings) will receive a much higher initial value than they are actually worth (since they are not differentiated as ADUs by the model).
-
Ensure that nearly identical properties are identically valued. For some property classes, such as 210 and 295s, we manually adjust values such that all identical properties in the same complex receive the same predicted value. This is accomplished by replacing individual predicted values with the average predicted value for the complex.
-
Round PIN-level values (typically to the nearest $1,000). This is done to indicate that model values are estimates, not precise values.
These adjustments have been collectively approved by the senior leadership of the CCAO. They are designed to limit the impact of data integrity issues, prevent regressivity in assessment, and ensure that people with nearly identical properties receive the same value.
This repository represents a significant departure from the old residential modeling codebase used to create assessed values in 2019 and 2020. As the CCAO’s Data department has grown, we’ve been able to dedicate more resources to building models, applications, and other tools. As a result, we’ve made the following major changes to the residential modeling codebase:
- Reduced the size of the codebase substantially from around 16,000 lines of R code. This was accomplished by moving complicated data handling to our internal R package and abstracting away machine learning logic to Tidymodels.
- Unified modeling for the entire county. Prior iterations of the residential model used individual models for each township. This was difficult to implement and track and performed worse than a single large model. The new model can value any residential property in the county, is significantly faster to train, and is much easier to replicate.
- Split the residential codebase into separate models for single/multi-family and condominiums. Previously, these models were combined in the same scripts, leading to a lot of complications and unnecessary overhead. Separating them makes it much easier to understand and diagnose each model.
- Switched to using LightGBM as our primary valuation model. LightGBM is essentially the most bleeding-edge machine learning framework widely available that isn’t a neural network. Prior to using LightGBM, we used linear models or R’s gbm package. Prior to 2018, the CCAO used linear models in SPSS for residential valuations.
- Improved dependency management via renv. Previously, users trying replicate our model needed to manually install a list of needed R packages. By switching to renv, we’ve vastly reduced the effort needed to replicate our modeling environment, see the installation section below.
- Moved previously separate processes into this repository and improved their integration with the overall modeling process. For example, the etl_res_data process was moved to pipeline/00-ingest.R, while the process to finalize model values was moved to pipeline/07-export.R.
- Added DVC support/integration. This repository
uses DVC in 2 ways:
- All input data in
input/
is versioned, tracked, and stored using DVC. Previous input data sets are stored in perpetuity on S3. - DVC pipelines are used to sequentially run R pipeline scripts and track/cache inputs and outputs.
- All input data in
- All model runs are now saved in perpetuity on S3. Each model’s outputs are saved as Parquet files which can be queried using Amazon Athena.
- Offloaded model reporting entirely to Tableau. This repository no longer produces markdown-based model outcome reports.
- Improved model accuracy significantly while reducing training time. This is largely due to the use of Lightsnip and the inclusion of many new features.
- Added per feature, per property contributions via LightGBM’s built-in SHAP methods.
- Reorganized the codebase into explicit pipeline stages, each of which can be run independently or via DVC.
- Added GitHub CI integration, which ensures that any model changes don’t result in significant output changes.
- Added updated sales flagging and validation scripts in partnership with the Mansueto Institute. See Representativeness.
- Rewrote the assessment stage for speed and improved accuracy when valuing prorated and multi-card PINs.
- Added new feature importance output table, which shows the gain, frequency, and cover for each model run.
- Added model QC and balance testing reports for ad-hoc analysis of model inputs.
- Updated multi-card heuristic to only apply to PINs with 2 cards (improvements on the same parcel).
- Updated townhome complex valuation method to prevent “chaining” via fuzzy grouping.
- Updated CV implementation so that Lightsnip and Tidymodels share the same validation set: Lightsnip for early stopping, Tidymodels for Bayesian optimization.
- Dropped explicit spatial lag generation in the ingest stage.
- Lots of other bugfixes and minor improvements.
- Moved sales validation to a dedicated repository located at ccao-data/model-sales-val.
- Infrastructure improvements
- Added
build-and-run-model
workflow to run the model using GitHub Actions and AWS Batch. - Added
delete-model-run
workflow to delete test run artifacts in S3 using GitHub Actions. - Updated pipeline/05-finalize step to render a performance report using Quarto and factored S3/SNS operations out into pipeline/06-upload.R.
- Added
- Added additional regressivity metrics (MKI) to measure model performance.
- Switched cross-validation to V-fold instead of time-based.
- Added new model features: corner lots, distance to vacant land/university/secondary roads, homeowner exemption indicator and length of exemption, number of recent sales, class.
- Added linear baseline model for comparison against LightGBM to pipeline/01-train.
- Added experimental comparable sales generation using LightGBM leaf nodes to pipeline/04-interpret.
- Refactored shared pipeline logic into separate scripts to simplify development and maintainability.
- Separated development/reporting dependencies from primary dependencies using renv profiles to increase replicability.
The CCAO faces a number of ongoing issues which make modeling difficult. Some of these issues are in the process of being solved; others are less tractable. We list them here for the sake of transparency and to provide a sense of the challenges we face.
We face a number of data-related challenges that are specific to our office. These issues are largely the result of legacy data systems, under-staffing, and the sheer number of properties in Cook County (over 1 million residential properties). We’re actively working to correct or mitigate most of these issues.
Our office tracks around 40 characteristics of individual properties. Of those 40, about 25 are usable in modeling. The remaining 15 characteristics are too sparse, too dirty, or too unbalanced to use. Additionally, our data is missing features commonly used in property valuation, such as:
- Property condition.
- Lot frontage.
- Land slope.
- Percentage of property above grade.
- Quality of finishes.
- Electrical and utility systems.
- Interior characteristics like finish quality, recent remodeling, or kitchen quality.
- Any information about pools.
- Information about location desirability or views.
This lack of characteristics contributes to larger errors when modeling, as it becomes difficult to distinguish between individual properties. For example, an extremely run-down mansion with otherwise high-value characteristics (good location, large number of bedrooms) may be significantly over-assessed, due to our model not accounting for property condition.
The property characteristics we track can sometimes be incorrect or outdated. The two major sources of characteristic errors are:
- Data entry or processing errors. Records collected by our office often need to digitized and mistakes happen. Fortunately, these types of errors are relatively rare.
- Characteristic update errors. There are a variety of systems that update the characteristics of properties in our system. Some of them can be slow to detect changes or otherwise unreliable.
These errors can cause under- or over-assessment. If you believe your property has been misvalued due to a characteristic error or the property characteristics recorded on our website are incorrect. Please contact our office to file a property characteristic appeal.
It is difficult for our office to determine whether or not any given property sale is arms-length. Non-arms-length sales, such as selling your home to a family member at a discount, can bias the model and result in larger assessment errors. We do our best to remove non-arms-length sales, but it’s nearly impossible to know for certain that every transaction is valid.
The Cook County property tax system is complex and can sometimes create perverse incentives.
For example, most property owners want their property taxes to be as low as possible, and are thus disincentivized from reporting characteristic errors which could raise their assessed value. Conversely, if a property owner plans to sell their home on a listing website, then they have a strong incentive (the highest possible sale price) to ensure the website accurately reflects their property’s characteristics. Listing websites know this and offer easy ways to self-update property attributes.
Falsely altering or not reporting property characteristics may change an assessed value, but it also has negative consequences for neighbors and similar properties. High sales on homes with incorrectly reported characteristics can upwardly bias the model, resulting in over-assessment for others.
In addition to the data challenges that are specific to our office, we also face the same modeling issues as most assessors and machine learning practitioners.
Cook County is an extremely large and diverse housing market. It spans millions of properties that vary widely in type, age, location, and quality. In some regions of the county, sales are common; in other regions, sales are sparse. Accurately estimating the price of such different properties and regions is a complicated, challenging task.
This challenge is especially acute in areas with high housing characteristic and price heterogeneity. For example, the Hyde Park neighborhood in Chicago is home to the University of Chicago and has large, multi-million-dollar houses near campus. However, sale prices drop precipitously just a few blocks away, as one passes south of 63rd street or west of I-90. This sort of sharp price discontinuity makes it difficult to accurately assess properties, as models tend to “smooth” such hard breaks unless geographic boundaries are explicitly defined.
Hyde Park is only one example, similarly unique situations exist throughout the county. Our model does account for some of these situations through neighborhood fixed effects and other location factors. However, effectively modeling major drivers of heterogeneity is an ongoing challenge.
Mass appraisal models need lots of sales data in order to accurately predict sale prices, but sales become more sparse toward either end of the price spectrum. The vast majority of properties (over 90%) in Cook County sell for between $50K and $2.5M. Predicting sale prices outside of that range is difficult; there just aren’t enough representative sales to train the model effectively.
This issue is particularly prevalent within certain geographies with unevenly distributed sales. For example, in New Trier township the average 2021 sale price was around $1.2 million, compared to the whole county average of around $400K. Lower values sales closer to the county average are rare in New Trier. Due to that rarity, lower value properties in New Trier are more likely to be overvalued. The same situation exists in reverse for lower value areas.
This problem isn’t limited to mass appraisal models; predictive models in general are not good at predicting outliers. We may implement new machine learning techniques or policies to deal with this issue in the future.
Q: My assessed value seems too low or too high. How do I fix it?
There are over one million residential properties in Cook County spanning a huge variety of locations, types, ages, and conditions. Mass appraisal should produce fair valuations for most properties. But a mass appraisal model isn’t going to accurately value every single property. If you believe that the value produced by our model is inaccurate, please file an appeal with our office.
Q: My home has been sold recently. Why isn’t my assessed value equal to my sale price?
Setting the assessed value of a home equal to the value of a recent sale is called selective appraisal or sales chasing. Sales chasing can artificially improve assessment performance statistics and bias statistical models. Worse, it can bias assessment accuracy in favor of recently sold properties, giving an unfair advantage to areas or properties with high turnover. For more information, see Appendix E of the IAAO Standard on Ratio Studies.
Q: How are comparables used in the model?
We don’t use sale or uniformity comparables for the purpose of modeling. Our model works by automatically finding patterns in sales data and extrapolating those patterns to predict prices; the model never explicitly says, “Here is property X and here are Y similar properties and their sale prices.”
We do use comparables for other things, namely when processing appeals and when evaluating the model’s performance. Note however that the comparables generated via #106 are experimental and are not currently used.
Q: What are the most important features in the model?
The importance of individual features in the model varies from place to place. Some properties will gain $50K in value from an additional bedroom, while others will gain almost nothing. However, some factors do stand out as more influential:
- Location. Two identical single-family homes, one in Wicker Park, the other in Markham, will not receive the same valuation. Location is the largest driver of county-wide variation in property value. This is accounted for in our model through a number of location-based features such as school district, neighborhood, township, and others.
- Square footage. Larger homes tend to be worth more than smaller ones, though there are diminishing marginal returns.
- Number of bedrooms and bathrooms. Generally speaking, the more rooms the better, though again there are diminishing returns. The value added by a second bedroom is much more than the value added by a twentieth bedroom.
Q: How much will one additional bedroom add to my assessed value?
Our model is non-linear, meaning it’s difficult to say things like, “Each additional square foot will increase this property’s value by $50,” as the relationship between price and individual features varies from property to property.
We do calculate the contribution of each feature to each property’s final value. For example, we can say things like, “Your close proximity to Lake Michigan added $5,000 to your home’s value.” We’re currently working on a way to share those feature-level results with property owners.
Q: Why don’t you use a simple linear model?
We decided that performance was more important than the easy interpretability offered by linear models, and LightGBM tends to outperform linear models on data with a large number of categorical features, interactions, and non-linearities.
Q: How do you measure model performance?
Assessors tend to use housing and assessment-specific measurements to gauge the performance of their mass appraisal systems, including:
- COD (Coefficient of Dispersion)
- PRD (Price-Related Differential)
- PRB (Price-Related Bias)
- MKI (Modified Kakwani Index)
More traditionally, we use R2, root-mean-squared-error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) to gauge overall model performance and fit.
Q: How often does the model change?
We’re constantly making minor tweaks to improve the model’s accuracy, speed, and usability. However, major changes to the model typically take place during the downtime between reassessments, so about once per year.
There are two ways of running the model:
- On a local machine (available to all users)
- In the cloud via AWS Batch (only available to CCAO staff)
The code in this repository is written primarily in R. Please install the latest version of R (requires R version >= 4.2.1) and RStudio before proceeding with the steps below.
If you’re on Windows, you’ll also need to install Rtools in order to build the necessary packages. You may also want to (optionally) install DVC to pull data and run the pipeline.
We also publish a Docker image containing the model code and all of the
dependencies necessary to run it. If you’re comfortable using Docker,
you can skip the installation steps below and instead pull the image
from ghcr.io/ccao-data/model-res-avm:master
to run the latest version
of the model.
- Clone this repository using git, or simply download it using the button at the top of the page.
- Set your working directory to the local folder containing this
repository’s files, either using R’s
setwd()
command or (preferably) using RStudio’s projects. - Install
renv
, R’s package manager, by runninginstall.packages("renv")
. - Install all R package dependencies using
renv
by runningrenv::restore()
. This step may take awhile. Linux users will likely need to install dependencies (via apt, yum, etc.) to build from source. - (Optional) The
finalize
step of the model pipeline requires some additional dependencies for generating a model performance report. Install these additional dependencies by runningrenv::restore(lockfile = "renv/profiles/reporting/renv.lock")
. These dependencies must be installed in addition to the core dependencies installed in step 4. If dependencies are not installed, the report will fail to generate and the pipeline stage will print the error message to the report file atreports/performance.html
; the pipeline will continue to execute in spite of the failure.
For installation issues, particularly related to package installation and dependencies, see Managing R dependencies and Troubleshooting.
To use this repository, simply open the pipeline/ directory and run the R scripts in order. Non-CCAO users can skip the following stages:
pipeline/00-ingest.R
- Requires access to CCAO internal AWS services to pull data. See Getting Data if you are a member of the public.pipeline/06-upload.R
- Requires access to CCAO internal AWS services to upload model results.pipeline/07-export.R
- Only required for CCAO internal processes.
The entire end-to-end pipeline can also be run using DVC. DVC will track the dependencies and parameters required to run each stage, cache intermediate files, and store versioned input data on S3.
To pull all the necessary input data based on the information in
dvc.lock
, run:
dvc pull
To run the entire pipeline (excluding the export stage), run:
dvc repro
Note that each stage will run only if necessary i.e. the ingest stage will not run if no parameters related to that stage have changed. To force a stage to re-run, run:
# Change ingest to any stage name
dvc repro -f ingest
To force the entire pipeline to re-run, run:
dvc repro -f
The web of dependencies, outputs, parameters, and intermediate files is
defined via the dvc.yaml
file. See that file for more
information about each stage’s outputs, inputs/dependencies, and related
parameters (defined in params.yaml
).
If you have write permissions for this repository (i.e. you are a member of the CCAO Data Department), you can run the model in the cloud on AWS Batch using GitHub Actions workflow runs.
Model runs are initiated by the
build-and-run-model
workflow via manual
dispatch.
To run a model, use the Run workflow button on right side of the
build-and-run-model
Actions
page.
Runs are gated behind a deploy
environment
that requires approval from a @ccao-data/core-team
member before the
model will run. The build
job to rebuild a Docker image for the model
will always run, but the subsequent run
job will not run unless a
core-team member approves it.
Runs can be monitored on AWS via CloudWatch as they execute in a Batch
job. Navigate to the run logs in the GitHub Actions console and look for
the build-and-run-model / run
job. Find the
Wait for Batch job to start and print link to AWS logs
step and expand
it to reveal a link to the CloudWatch logs for the run.
Test runs of the model can be deleted using the
delete-model-runs
workflow. This workflow will delete all of the associated run artifacts
from S3. To delete one or more runs, copy their unique IDs
(e.g. 2024-01-01-foo-bar
) and paste them in the workflow dispatch
input box, with each run ID separated by a space
(e.g. 2024-01-01-foo-bar 2024-02-02-bar-baz
).
⚠️ NOTE: In order to protect production model run artifacts, thedelete-model-runs
workflow can only delete model runs for the upcoming assessment cycle (the current year from January-April, or the next year from May-December). The workflow will raise an error if you attempt to delete a model run outside the upcoming assessment cycle.In the off chance that you do in fact need to delete a test run from a previous assessment cycle, you can work around this limitation by moving model run artifacts to bucket prefixes representing the partition for the upcoming assessment year (e.g.
year=2024/
) and then proceed to delete the model run.
All control parameters, hyperparameters, toggles, etc. are stored in
params.yaml
. Almost all modifications to the pipeline
are made via this file. It also contains a full description of each
parameter and its purpose.
Each R script has a set of associated parameters (tracked via
dvc.yaml
). DVC will automatically detect changes in these parameters
and will re-run stages for which parameters have changed. Stages without
changed parameters or input data are cached and will be automatically
skipped by DVC.
The full model pipeline produces a large number of outputs. A full list
of these outputs and their purpose can be found in
misc/file_dict.csv
. For public users, all
outputs are saved in the output/
directory, where they can
be further used/examined after a model run. For CCAO employees, all
outputs are uploaded to S3 via the upload stage.
Uploaded Parquet files are converted into the following Athena tables:
Athena Table | Observation Unit | Primary Key | Description |
---|---|---|---|
assessment_card | card | year, run_id, township_code, meta_pin, meta_card_num | Assessment results at the card level AKA raw model output |
assessment_pin | pin | year, run_id, township_code, meta_pin | Assessment results at the PIN level AKA aggregated and cleaned |
comp | card | year, run_id, meta_pin, meta_card_num | Comparables for each card (computed using leaf node assignments) |
feature_importance | predictor | year, run_id, model_predictor_all_name | Feature importance values (gain, cover, and frequency) for the run |
metadata | model run | year, run_id | Information about each run, including parameters, run ID, git info, etc. |
parameter_final | model run | year, run_id | Chosen set of hyperparameters for each run |
parameter_range | parameter | year, run_id, parameter_name | Range of hyperparameters searched during CV tuning |
parameter_search | model cv fold | year, run_id, configuration, fold_id | Tidymodels tuning output from cross-validation |
performance | geography [by class] | year, run_id, stage, geography_type, geography_id, by_class, class | Peformance metrics (optionally) broken out by class for different levels of geography |
performance_quantile | geography [by class] by quantile | year, run_id, stage, geography_type, geography_id, by_class, class, quantile | Performance metrics by quantile within class and geography |
shap | card | year, run_id, township_code, meta_pin, meta_card_num | SHAP values for each feature for each card in the assessment data |
test_card | card | year, meta_pin, meta_card_num | Test set predictions at the card level |
timing | model run | year, run_id | Finalized time elapsed for each stage of the run |
The data required to run these scripts is produced by the ingest stage, which uses SQL pulls from the CCAO’s Athena database as a primary data source. CCAO employees can run the ingest stage or pull the latest version of the input data from our internal DVC store using:
dvc pull
Public users can download data for each assessment year using the links
below. Each file should be placed in the input/
directory
prior to running the model pipeline.
- assessment_data.parquet
- complex_id_data.parquet
- land_nbhd_rate_data.parquet
- land_site_rate_data.parquet
- training_data.parquet
- assessment_data.parquet
- complex_id_data.parquet
- land_nbhd_rate_data.parquet
- land_site_rate_data.parquet
- training_data.parquet
Due to a data
issue with the
initial 2024 model run, there are actually two final 2024 models. The
run 2024-02-06-relaxed-tristan
was used for Rogers Park and West
townships only, while the run 2024-03-17-stupefied-maya
was used for
all subsequent City of Chicago townships.
The data issue caused some sales to be omitted from the
2024-02-06-relaxed-tristan
training set, however the actual impact on
predicted values was extremely minimal. We chose to update the data
and create a second final model out of an abundance of caution, and,
given low transaction volume in 2023, to include as many arms-length
transactions in the training set as possible.
- assessment_data.parquet
- char_data.parquet
- complex_id_data.parquet
- hie_data.parquet
- land_nbhd_rate_data.parquet
- land_site_rate_data.parquet
- training_data.parquet
- assessment_data.parquet
- char_data.parquet
- complex_id_data.parquet
- hie_data.parquet
- land_nbhd_rate_data.parquet
- land_site_rate_data.parquet
- training_data.parquet
For other data from the CCAO, please visit the Cook County Data Portal.
Both Tidymodels and LightGBM support parallel processing to speed up model training. However, the current parallel implementation in Tidymodels is extremely memory-intensive, as it needs to carry loaded packages and objects into each worker process. As such, parallel processing in Tidymodels is turned off, while parallel processing in LightGBM is turned on. This means that models are fit sequentially, but each model fitting is sped up using the parallel processing built-in to LightGBM. Note that:
- The total amount of RAM needed for overall model fitting is around 6GB, though this is ultimately dependent on a number of LightGBM parameters.
- The number of threads is set via the
num_threads
parameter, which is passed to the model using the
set_args()
function fromparsnip
. By default,num_threads
is equal to the full number of physical cores available. More (or faster) cores will decrease total training time. - This repository uses the CPU version of LightGBM included with the LightGBM R package. If you’d like to use the GPU version you’ll need to build it yourself.
We use renv to manage R
dependencies. The main model dependencies are listed explicitly in the
DESCRIPTION
file under the Depends:
key. These dependencies are
installed automatically when you run renv::restore()
.
We use multiple renv lockfiles to manage R dependencies:
renv.lock
is the canonical list of dependencies that are used by the core model pipeline. Any dependencies that are required to run the model itself should be defined in this lockfile.renv/profiles/reporting/renv.lock
is the canonical list of dependencies that are used to generate model reports in thefinalize
step of the pipeline. Any dependencies that are required to generate reports should be defined in this lockfile.renv/profiles/dev/renv.lock
is the canonical list of dependencies that are used for local development, running theingest
,export
, andapi
steps of the pipeline, and building the README. These dependencies are required only by CCAO staff and are not required to run the model itself.
Our goal in maintaining multiple lockfiles is to keep the list of dependencies required to run the model as short as possible. This choice adds overhead to the process of updating R dependencies, but incurs the benefit of a more maintainable model over the long term.
When working on the model locally, you’ll typically want to install
non-core dependencies on top of the core dependencies. To do this,
simply run renv::restore(lockfile = "<path_to_lockfile")
to install
all dependencies from the lockfile.
For example, if you’re working on the ingest
stage and want to install
all its dependencies, start with the main profile (run
renv::activate()
), then install the dev
profile dependencies on top
of it (run renv::restore(lockfile = "renv/profiles/dev/renv.lock")
).
⚠️ WARNING: Installing dependencies from a dev lockfile will overwrite any existing version installed by the core one. For example, ifggplot2@3.3.0
is installed by the core lockfile, andggplot2@3.2.1
is installed by the dev lockfile, renv will overwriteggplot2@3.3.0
withggplot2@3.2.1
.
The process for updating core model pipeline dependencies is straightforward:
- Add the dependency to the list of explicit dependencies under the
Depends:
key of theDESCRIPTION
file - Run
renv::install("<dependency_name>")
- Run
renv::snapshot()
to update the core lockfile (the rootrenv.lock
)
The process for updating *dependencies for other lockfiles** is more
complex, since it requires the use of a separate profile when running
renv commands. Determine the name of the profile you’d like to update
(<profile_name>
in the code that follows) and run the following
commands:
- Run
renv::activate(profile = "<profile_name>")
to set the renv profile to<profile_name>
- Make sure that the dependency is defined in the
DESCRIPTION
file under theConfig/renv/profiles/<profile_name>/dependencies
key - Run
renv::install("<dependency_name>")
to add or update the dependency as necessary - Run
renv::snapshot()
to update the reporting lockfile with the dependencies defined in theDESCRIPTION
file - Run
renv::activate(profile = "default")
if you would like to switch back to the default renv profile
The dependencies for this repository are numerous and not all of them may install correctly. Here are some common install issues (as seen in the R console) as well as their respective resolutions:
-
Error:
WARNING: Rtools is required to build R packages, but is not currently installed
Solution: Install the latest version of Rtools from CRAN, following the instructions listed. -
Error:
DLL '<package-name>' not found: maybe not installed for this architecture?
Solution: Try installing the package manually with theINSTALL_opts
flag set. See here for an example.
Distributed under the AGPL-3 License. See LICENSE for more information.
We welcome pull requests, comments, and other feedback via GitHub. For more involved collaboration or projects, please see the Developer Engagement Program documentation on our group wiki.