This Python project is designed to fetch, process, and analyze presidential polling data, providing a comprehensive and nuanced assessment of the current electoral landscape. It consists of several main scripts:
analysis.py
: The core script responsible for data fetching, processing, and applying various weighting mechanisms to adjust poll results.states.py
: A script that scrapes state-specific electoral data from both the 270 To Win website and FiveThirtyEight, enhancing the analysis with state-level insights.app.py
: A Streamlit application that provides an interactive user interface for visualizing results and adjusting configuration parameters dynamically.config.py
: A configuration file containing adjustable parameters that control the behavior of the analysis.
The project leverages data from FiveThirtyEight's publicly available CSV files for both presidential polls and favorability polls. By applying a series of weightings to adjust for various factors—such as poll quality, partisanship, sample population type, and state significance—the analysis aims to produce an adjusted polling metric that more accurately reflects the true state of the presidential race.
Additionally, the project incorporates a mechanism to purge polls from specified pollsters known to potentially bias polling averages. This is managed through a purge.json
file, allowing for the exclusion of unreliable or partisan pollsters from the analysis.
- File Structure
- Data Acquisition
- Poll Purging Mechanism
- Weighting Calculations
- 1. Time Decay Weight
- 2. Grade Weight
- 3. Transparency Weight
- 4. Sample Size Weight
- 5. Partisan Weight
- 6. Population Weight
- 7. State Rank Weight
- 8. Combining Weights
- 9. Calculating Polling Metrics
- 10. Calculating Favorability Differential
- 11. Combining Polling Metrics and Favorability Differential
- 12. Out-of-Bag (OOB) Variance Calculation
- Error Handling and Normalization
- Data Caching Mechanisms
- Configuration Options
- Conclusion
- Possible Next Steps
.
├── analysis.py
├── app.py
├── config.py
├── purge.json
├── readme.md
├── requirements.txt
├── states.py
└── streamlit
analysis.py
: Core script for fetching and analyzing polling data.app.py
: Streamlit application providing an interactive user interface.config.py
: Configuration file containing adjustable parameters.purge.json
: JSON file listing pollsters to be excluded from the analysis.states.py
: Script for scraping state-specific electoral data.readme.md
: Comprehensive project documentation.requirements.txt
: List of Python dependencies.streamlit
: Directory containing additional resources for the Streamlit app.
The project relies on three primary data sources to ensure a robust and comprehensive analysis:
-
Presidential Polling Data:
- Source: FiveThirtyEight
- Description: Provides detailed polling information for presidential candidates, including pollster ratings, sample sizes, and polling dates.
- Method of Acquisition: Data is fetched using the Python
requests
library and loaded into apandas
DataFrame for analysis.
-
Favorability Polling Data:
- Source: FiveThirtyEight
- Description: Offers insights into public sentiment regarding candidates, capturing favorability ratings which are crucial for understanding broader public perceptions.
- Method of Acquisition: Similar to the presidential polling data, it is fetched and processed into a
pandas
DataFrame.
-
State Data:
- Sources:
- Description: Contains information about each state's electoral votes, political leanings, and forecasted election outcomes, essential for calculating state-specific weights.
- Method of Acquisition: The
states.py
script scrapes and processes data from both websites to obtain up-to-date state rankings and forecasts.
Justification for Data Sources:
- FiveThirtyEight: Renowned for its rigorous methodology and comprehensive data, making it a reliable source for polling and forecast information.
- 270 To Win: Provides up-to-date and detailed electoral data, essential for state-level analysis.
To ensure the integrity of the polling analysis, the project incorporates a Poll Purging Mechanism. This feature allows the exclusion of polls from specified pollsters known to potentially bias polling averages. The mechanism is managed through a purge.json
file and a corresponding configuration option in the Streamlit app.
-
Purpose: Lists pollsters and sponsoring organizations whose polls should be excluded from the analysis to prevent manipulation of polling averages.
-
Structure:
{ "invalid": [ "American Greatness", "American Pulse Research and Polling", "Bullfinch", "Daily Mail", "co/efficent", "Cygnal", "Echelon", "Emerson", "Fabrizio", "Fox News", "Hunt Research", "Insider Advantage", "J.L. Partners", "McLaughlin", "Mitchell Communications", "Napolitan Institute", "Noble Predictive", "On Message", "Orbital Digital", "Public Opinion Strategies", "Quantus", "Rasmussen", "Redfield & Wilton", "Remington", "RMG", "SoCal Data", "The Telegraph", "TIPP", "Trafalgar", "Victory Insights", "University of Austin", "The Wall Street Journal" ] }
-
Implementation:
- Located in the root directory of the project alongside
analysis.py
andapp.py
. - The
analysis.py
script reads this file to identify and exclude specified pollsters from the analysis.
- Located in the root directory of the project alongside
-
Feature: A checkbox labeled "Purge Polls" in the configuration sidebar of the Streamlit app.
-
Functionality:
- Checked: Activates the Poll Purging Mechanism, excluding all pollsters listed in
purge.json
from the analysis. - Unchecked: Includes all pollsters in the analysis, regardless of their presence in
purge.json
.
- Checked: Activates the Poll Purging Mechanism, excluding all pollsters listed in
-
User Guidance:
- Description: "Check to remove pollsters who are trying to game the system."
- Tooltip: Provides additional context about the purpose of purging polls.
-
Implementation:
- When poll purging is activated, the app logs the number of polls removed due to exclusion of invalid pollsters.
- Provides real-time feedback to the user indicating the activation of poll purging and the impact on data processing.
-
Example Feedback:
Purging 31 pollsters from the analysis. Removed 70 polls from invalid pollsters.
-
Preventing Bias: Excluding polls from known partisan or unreliable pollsters ensures that the polling averages reflect a more accurate and unbiased picture of the electoral landscape.
-
Maintaining Integrity: By allowing users to control the inclusion or exclusion of specific pollsters, the analysis remains transparent and adaptable to evolving polling dynamics.
To adjust raw polling data and produce a more accurate reflection of the electoral landscape, the project applies several weighting mechanisms. Each weight addresses a specific factor that can influence the reliability or relevance of a poll.
Objective: To prioritize recent polls over older ones, acknowledging that public opinion can change rapidly.
Mathematical Formulation:
The weight decreases exponentially with the age of the poll using NumPy's exponential function:
-
( t ): The age of the poll in fractional days.
$$ t = \frac{\text{Current Timestamp} - \text{Poll Timestamp}}{86400} $$
- Justification: Using fractional days increases precision, especially for recent polls where hours can make a difference.
-
( \lambda ): The decay constant, representing how quickly the weight decreases over time.
$$ \lambda = \frac{\ln(\text{DECAY_RATE})}{\text{HALF_LIFE_DAYS}} $$
-
Parameters:
-
DECAY_RATE
: The rate at which the weight decays over the half-life period (default is1.0
, meaning no decay). -
HALF_LIFE_DAYS
: The half-life period in days (default is14
days).
-
-
Parameters:
Justification for Exponential Decay:
- Exponential decay reflects the idea that the influence of a poll diminishes over time.
- The half-life parameter allows for control over how quickly this influence wanes.
- Exponential functions are continuous and smooth, providing a realistic decay model.
Implementation:
- Adjust the
DECAY_RATE
andHALF_LIFE_DAYS
inconfig.py
or via the Streamlit app to reflect the desired decay behavior.
Objective: To adjust poll weights based on the historical accuracy and methodological quality of the polling organizations.
Mathematical Formulation:
The normalized grade weight is calculated as:
- Numeric Grade: A numerical representation of the pollster's grade assigned by FiveThirtyEight.
- Max Numeric Grade: The highest numeric grade among all pollsters in the dataset.
Justification:
- Pollsters with higher grades have historically produced more accurate polls.
- Normalization ensures that grades are scaled between 0 and 1, allowing for consistent weighting across different datasets.
Error Handling:
- If
Max Numeric Grade
is zero (which could happen if grades are missing), a small value (ZERO_CORRECTION = 0.0001
) is used to prevent division by zero.
Implementation:
- Ensure that the grades are properly converted to numeric values, handling any non-standard grades or missing values.
- The normalized grade is clipped to ensure it remains within the [0, 1] range using Pandas'
clip()
function.
Objective: To reward polls that are transparent about their methodologies, which can be an indicator of reliability.
Mathematical Formulation:
- Transparency Score: A score provided by FiveThirtyEight that reflects the level of methodological disclosure by the pollster.
- Max Transparency Score: The highest transparency score among all polls in the dataset.
Justification:
- Transparency allows for better assessment of a poll's quality.
- Polls that disclose their methods fully are more trustworthy.
Error Handling:
- If
Max Transparency Score
is zero,ZERO_CORRECTION
is used to prevent division by zero.
Implementation:
- Convert transparency scores to numeric values and handle any non-standard or missing values.
- Normalize and clip the transparency scores to maintain consistency.
Objective: To account for the reliability of polls based on the number of respondents.
Mathematical Formulation:
- Sample Size: The number of respondents in the poll.
- Min Sample Size and Max Sample Size: The minimum and maximum sample sizes across all polls, determined using Pandas'
min()
andmax()
functions.
Justification:
- Larger sample sizes generally lead to more accurate and reliable results due to reduced sampling error.
- Normalizing the sample size ensures that weights are proportionate across the range of sample sizes.
Error Handling:
- If
Max Sample Size - Min Sample Size
is zero,ZERO_CORRECTION
is used.
Implementation:
- Calculate the minimum and maximum sample sizes using Pandas'
min()
andmax()
functions. - Normalize the sample sizes and handle cases where all sample sizes are identical.
Objective: To mitigate potential biases introduced by polls sponsored by partisan organizations.
Mathematical Formulation:
Default Values in config.py
:
PARTISAN_WEIGHT = {True: 0.01, False: 1.0}
Justification:
- Partisan polls may exhibit bias toward a particular candidate or party.
- Assigning a significantly lower weight to partisan polls reduces their impact on the overall analysis.
Implementation:
- Weights are assigned using a dictionary lookup for efficiency.
- The weight values can be adjusted in
config.py
or via the Streamlit app to reflect the desired level of influence from partisan polls.
Objective: To adjust poll weights based on the population type surveyed, reflecting the likelihood that respondents will vote.
Population Types and Weights:
Population Type | Weight |
---|---|
Likely Voters (lv ) |
POPULATION_WEIGHTS['lv'] = 1.0 |
Registered Voters (rv ) |
POPULATION_WEIGHTS['rv'] = 0.75 |
Voters (v ) |
POPULATION_WEIGHTS['v'] = 0.5 |
Adults (a ) |
POPULATION_WEIGHTS['a'] = 0.25 |
All Respondents (all ) |
POPULATION_WEIGHTS['all'] = 0.01 |
Justification:
- Likely Voters are most representative of the actual electorate, so they receive the highest weight.
- Registered Voters are somewhat less predictive, as not all registered voters turn out.
- Voters and Adults include individuals who may not be eligible or likely to vote, so they receive lower weights.
- All Respondents include the broadest population, many of whom may not be eligible or likely to vote, thus receiving the lowest weight.
Implementation:
- These weights are assigned using a dictionary and can be adjusted in
config.py
or via the Streamlit app to reflect changes in voter behavior or to conduct sensitivity analyses.
The analysis incorporates both state-specific rankings and special handling for national polls to ensure a comprehensive and nuanced approach to poll weighting.
Objective: To calculate a weight for each state poll based on the state's electoral importance, partisan classification, and current election forecasts, thereby prioritizing polls from significant and competitive states.
Mathematical Formulation:
The state rank for each state is calculated as a weighted sum of three components:
Where:
- Pro Status Value: A numerical representation of the state's partisan lean, derived from the
pro_status
codes provided by 270 To Win. - Normalized Electoral Votes: The state's electoral votes divided by the total electoral votes (538), representing the state's relative electoral significance.
- Forecast Weight: Based on FiveThirtyEight's forecast data, representing the closeness of the race in each state.
Components Explanation:
-
Pro Status Value (40% of State Rank):
- Derived from the state's political classification:
-
T
: Toss-up state (0.8) -
D1
,R1
: Tilt Democrat/Republican (0.6) -
D2
,R2
: Lean Democrat/Republican (0.4) -
D3
,R3
: Likely Democrat/Republican (0.2) -
D4
,R4
: Safe Democrat/Republican (0.1)
-
- Justification: Reflects the competitiveness of the state based on historical and current political leanings, with higher values for more competitive states.
- Derived from the state's political classification:
-
Normalized Electoral Votes (30% of State Rank):
-
Calculated as:
$$ \text{Normalized Electoral Votes} = \frac{\text{State's Electoral Votes}}{538} $$
-
Justification: Gives more weight to states with more electoral votes, reflecting their greater potential impact on the election outcome.
-
-
Forecast Weight (30% of State Rank):
-
Calculated as:
$$ \text{Forecast Weight} = 1 - \left( \frac{|\text{Forecast Median}|}{100} \right) $$
- Forecast Median: The median forecasted margin between the candidates from FiveThirtyEight's data.
-
Justification: Prioritizes states with closer races, as they are more likely to influence the election outcome.
-
Implementation Details:
- The
states.py
script ensures that state data is up-to-date and accurately reflects current political dynamics. - The
STATE_RANK_MULTIPLIER
inconfig.py
allows for adjusting the influence of state rankings in the combined weight calculation.
Objective: To appropriately weight national polls in relation to state polls, recognizing their distinct nature and potential impact on the overall analysis.
Implementation:
-
Identification of National Polls:
-
National polls are identified by the absence of state-specific information:
df['is_national'] = df['state'].isnull() | (df['state'] == '')
-
-
Special Weighting for National Polls:
-
National polls receive an additional weight adjustment:
national_weight = config.NATIONAL_POLL_WEIGHT df.loc[df['is_national'], 'combined_weight'] *= national_weight
-
This allows for fine-tuning the influence of national polls relative to state polls in the overall analysis.
-
-
Configurable National Poll Weight:
-
The
NATIONAL_POLL_WEIGHT
can be adjusted in the configuration to increase or decrease the impact of national polls.NATIONAL_POLL_WEIGHT = 1.0 # Adjust to increase or decrease influence of national polls
-
Justification:
- National polls provide a broad overview of the electoral landscape but may not capture state-specific nuances.
- This approach allows for balancing the insights from national polls with the more granular information provided by state polls.
- The configurable weight enables analysts to adjust the relative importance of national polls based on their assessment of poll reliability and relevance.
Implementation Details:
- The State Rank Weight is incorporated into the overall poll weighting as one of the factors in the combined weight calculation.
- Its influence can be adjusted using the
STATE_RANK_MULTIPLIER
inconfig.py
. - National polls are handled separately, and their weight can be adjusted using
NATIONAL_POLL_WEIGHT
.
An essential step in the analysis is to aggregate the individual weights calculated from various factors into a single Combined Weight for each poll. This combined weight determines the overall influence each poll will have on the final polling metrics.
Objective: To combine individual weights into a single weight that reflects all factors influencing poll reliability and relevance.
Methods of Combining Weights:
-
Multiplicative Combination (when
HEAVY_WEIGHT = True
):$$ W_{\text{combined}} = \prod_{k} \left( W_k \times \text{Multiplier}_k \right) $$
-
Pros:
- Strongly penalizes polls weak in any single criterion.
- Emphasizes high-quality polls.
-
Cons:
- Can overly penalize polls with minor weaknesses.
Implementation Note: Utilizes NumPy's
prod()
function for efficient computation. -
Pros:
-
Additive Combination (when
HEAVY_WEIGHT = False
):$$ W_{\text{combined}} = \frac{\sum_{k} \left( W_k \times \text{Multiplier}_k \right)}{n} $$
-
Pros:
- Balances the influence of each weight.
- More forgiving of polls with mixed strengths and weaknesses.
-
Cons:
- May allow lower-quality polls to have more influence than desired.
Implementation Note: Utilizes NumPy's
mean()
function to calculate the average of weighted components. -
Pros:
Multipliers:
Multipliers adjust the influence of each individual weight:
TIME_DECAY_WEIGHT_MULTIPLIER = 1.0
SAMPLE_SIZE_WEIGHT_MULTIPLIER = 1.0
NORMALIZED_NUMERIC_GRADE_MULTIPLIER = 1.0
NORMALIZED_POLLSCORE_MULTIPLIER = 1.0
NORMALIZED_TRANSPARENCY_SCORE_MULTIPLIER = 1.0
POPULATION_WEIGHT_MULTIPLIER = 1.0
PARTISAN_WEIGHT_MULTIPLIER = 1.0
STATE_RANK_MULTIPLIER = 1.0
Implementation Steps:
- Calculate Individual Weights: Compute each weight as described in the previous sections.
- Apply Multipliers: Multiply each weight by its corresponding multiplier.
- Combine Weights: Use the chosen method (multiplicative or additive) to compute the combined weight.
- Normalization (Optional): Ensure combined weights are on a consistent scale.
Example:
If HEAVY_WEIGHT
is set to True
, the combined weight for a poll would be the product of all individual weights multiplied by their respective multipliers. If set to False
, it would be the average of these weighted components.
Objective: To compute an adjusted polling metric for each candidate by combining poll results with their respective combined weights.
Methodology:
- Data Filtering: Select relevant polls for the candidates within the specified time frame.
-
Percentage Handling: Standardize percentage values to ensure consistency.
- If a percentage value (
pct
) is less than or equal to 1, it's assumed to be a proportion and multiplied by 100.
- If a percentage value (
- Combined Weight Calculation: Calculate the combined weight for each poll using the methods described in Combining Weights.
-
Weighted Sum and Total Weights:
- Weighted Sum: $$ \text{Weighted Sum}c = \sum{i \in c} W_{\text{combined}, i} \times \text{pct}_i $$
- Total Weight: $$ \text{Total Weight}c = \sum{i \in c} W_{\text{combined}, i} $$
- Weighted Average: $$ \text{Weighted Average}_c = \frac{\text{Weighted Sum}_c}{\text{Total Weight}_c} $$
Margin of Error Calculation:
-
Effective Sample Size: $$ n_{\text{effective}} = \sum_{i} W_{\text{combined}, i} \times n_i $$
-
Margin of Error: $$ \text{Margin of Error}c = z \times \sqrt{\frac{p(1 - p)}{n{\text{effective}}}} \times 100% $$
- ( p ): Proportion (Weighted Average divided by 100).
- ( z ): Z-score (default is 1.96 for 95% confidence).
Implementation:
- Utilize Pandas for efficient data filtering, aggregation, and computation.
- Handle edge cases where
Total Weight
is zero to prevent division by zero errors.
Example Calculation:
For a candidate with the following polls:
Poll | Combined Weight (( W_{\text{combined}} )) | Percentage (( \text{pct} )) |
---|---|---|
1 | 0.8 | 50 |
2 | 1.0 | 55 |
3 | 0.6 | 45 |
-
Weighted Sum: $$ \text{Weighted Sum} = (0.8 \times 50) + (1.0 \times 55) + (0.6 \times 45) = 40 + 55 + 27 = 122 $$
-
Total Weight: $$ \text{Total Weight} = 0.8 + 1.0 + 0.6 = 2.4 $$
-
Weighted Average: $$ \text{Weighted Average} = \frac{122}{2.4} \approx 50.83% $$
-
Margin of Error: Assuming ( n_{\text{effective}} = 0.8 \times 1000 + 1.0 \times 800 + 0.6 \times 1200 = 800 + 800 + 720 = 2320 ), $$ \text{Margin of Error} = 1.96 \times \sqrt{\frac{0.5083 \times (1 - 0.5083)}{2320}} \times 100% \approx 1.96 \times 0.0103 \times 100% \approx 2.02% $$
Output:
- Weighted Average: 50.83%
- Margin of Error: ±2.02%
Objective: To calculate a weighted favorability differential for each candidate, reflecting net public sentiment.
Methodology:
- Data Filtering: Extract favorability polls relevant to the candidates.
-
Normalization: Standardize 'favorable' and 'unfavorable' percentages.
- If a favorability percentage is less than or equal to 1, it's assumed to be a proportion and multiplied by 100.
- Combined Weight Calculation: Calculate weights relevant to favorability data using similar methods as polling metrics.
- Weighted Favorability Differential: $$ \text{Favorability Differential}c = \frac{\sum{i \in c} W_{\text{combined}, i} \times \text{favorable}i - \sum{i \in c} W_{\text{combined}, i} \times \text{unfavorable}i}{\sum{i \in c} W_{\text{combined}, i}} $$
Implementation:
- Utilize Pandas for efficient data manipulation and aggregation.
- Handle cases where
Total Weight
is zero to prevent division by zero errors.
Example Calculation:
For a candidate with the following favorability polls:
Poll | Combined Weight (( W_{\text{combined}} )) | Favorable (( \text{favorable} )) | Unfavorable (( \text{unfavorable} )) |
---|---|---|---|
1 | 0.9 | 60 | 30 |
2 | 1.1 | 55 | 35 |
3 | 0.7 | 50 | 40 |
-
Weighted Favorable Sum: $$ (0.9 \times 60) + (1.1 \times 55) + (0.7 \times 50) = 54 + 60.5 + 35 = 149.5 $$
-
Weighted Unfavorable Sum: $$ (0.9 \times 30) + (1.1 \times 35) + (0.7 \times 40) = 27 + 38.5 + 28 = 93.5 $$
-
Total Weight: $$ \text{Total Weight} = 0.9 + 1.1 + 0.7 = 2.7 $$
-
Favorability Differential: $$ \text{Favorability Differential} = \frac{149.5 - 93.5}{2.7} \approx \frac{56}{2.7} \approx 20.74% $$
Output:
- Favorability Differential: +20.74%
Enhancements:
-
Net Favorability Score: Incorporate both favorable and unfavorable responses to provide a net favorability score:
$$ \text{Net Favorability} = \frac{\text{Weighted Favorable} - \text{Weighted Unfavorable}}{\text{Total Weight}} $$
Implementation Note:
- Ensure that both
favorable
andunfavorable
are on comparable scales before combining. If one metric inherently has a larger range or different distribution, consider normalizing them to prevent one from dominating the other.
Objective: To produce a final adjusted result by blending the weighted polling metrics with the favorability differential.
Mathematical Formulation:
- ( \alpha ): Favorability Weight (default is
0.15
).
Implementation:
- Adjust the
FAVORABILITY_WEIGHT
inconfig.py
or via the Streamlit app. - Compute the final result for each candidate using the formula above.
- Utilize Pandas operations to merge and compute these metrics across the dataset efficiently.
Example Calculation:
Given:
- Polling Metric: 50.83%
- Favorability Differential: +20.74%
- Favorability Weight (( \alpha )): 0.15
The Combined Result would be:
Output:
- Combined Result: 46.30%
Implementation Note:
- Ensure that both
polling_score
andfavorability
are on comparable scales before combining. If one metric inherently has a larger range or different distribution, consider normalizing them to prevent one from dominating the other.
Objective: To estimate the variance associated with the polling metrics using a Random Forest model.
Methodology:
-
Random Forest Model: Utilize the
RandomForestRegressor
withoob_score=True
. -
Pipeline Components:
-
Imputation:
-
Strategy:
SimpleImputer
with a median strategy to handle missing data. - Function: Ensures that all feature columns are free of missing values before model training.
-
Strategy:
-
Model Training:
-
Model:
RandomForestRegressor
with specified parameters fromconfig.py
. -
Parameters:
N_TREES = 1000 RANDOM_STATE = 42
-
Model:
-
Imputation:
-
OOB Variance:
$$ \sigma_{\text{OOB}}^2 = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y}_i^{\text{OOB}} \right)^2 $$
- ( y_i ): Actual value.
- ( \hat{y}_i^{\text{OOB}} ): OOB prediction.
Justification:
- The OOB error provides an unbiased estimate of the model's prediction error.
- Enhances the reliability of the analysis by quantifying uncertainty.
Implementation:
-
Data Preparation:
- Combine polling and favorability data into a single DataFrame.
- Select relevant feature columns based on availability.
-
Target Variable (
y
):- Use
'pct'
from polling data and'favorable'
from favorability data as the target. - Ensure that these metrics are comparable or consider handling them separately.
- Use
-
Pipeline Execution:
- Implement a
Pipeline
that first imputes missing data and then fits the Random Forest model.
- Implement a
-
Variance Calculation:
- After fitting, extract the OOB predictions and calculate the variance between actual values and OOB predictions.
-
Error Handling:
- If feature columns are missing or data is insufficient, log appropriate warnings and return a default variance value.
Example Calculation:
Suppose after model training:
- Actual Values (( y )): [52, 48, 50, 51, 49]
- OOB Predictions (( \hat{y}^{\text{OOB}} )): [51, 49, 50, 50, 50]
Then:
Output:
- OOB Variance: 0.8
Implementation Note:
- The variance is reported as a single numerical value representing the average squared difference between actual and predicted values, providing insight into the model's prediction accuracy.
- Ensure that the target variables are compatible when combining
'pct'
and'favorable'
. If they represent different constructs, consider training separate models or standardizing them appropriately.
To ensure mathematical integrity and robustness, the project includes comprehensive error handling and data normalization procedures.
Key Strategies:
-
Division by Zero Prevention:
- Method: Use of a small constant (
ZERO_CORRECTION = 0.0001
) to prevent division by zero in weight calculations. - Implementation: Applied in scenarios where the denominator could potentially be zero, such as when normalizing weights.
- Method: Use of a small constant (
-
Missing Data Handling:
- Method: Assign default values or exclude data points with missing critical information.
- Implementation: Utilizes Pandas functions like
dropna()
andfillna()
to manage missing data effectively.
-
Percentage Interpretation:
- Method: Adjust percentages that are likely misformatted (e.g., values less than or equal to 1) by multiplying them by 100 to convert proportions to percentages.
- Implementation: Applies lambda functions within Pandas
apply()
methods to standardize percentage values.
-
Time Calculations:
- Method: Utilize timezone-aware timestamps and fractional days to accurately compute time-related weights.
- Implementation: Employs Pandas'
to_datetime()
with UTC time zones and calculates time differences in fractional days for precision.
Justification:
- These measures prevent computational errors and ensure that the analysis remains accurate and reliable.
- Proper handling of data anomalies enhances the robustness of the results.
Implementation Example:
# Prevent division by zero by using ZERO_CORRECTION
if max_numeric_grade != 0:
df['normalized_numeric_grade'] = df['numeric_grade'] / max_numeric_grade
else:
df['normalized_numeric_grade'] = config.ZERO_CORRECTION
# Normalize and clip to [0, 1]
df['normalized_numeric_grade'] = df['normalized_numeric_grade'].clip(0, 1)
To improve performance and user experience, the project implements data caching.
Features:
-
Caching Data Files:
- Purpose: Processed data is saved locally, reducing the need to re-fetch and re-process data on each run.
- Implementation: Utilizes local CSV files to store processed polling and favorability data.
-
Configuration Cache:
- Purpose: User settings are cached to maintain consistency across sessions.
- Implementation: Stores configuration parameters in a JSON file to persist user-defined settings.
-
Force Refresh Option:
- Purpose: Users can clear caches and refresh data to incorporate the latest information or configuration changes.
- Implementation: Provides a checkbox in the Streamlit app to force data refresh, bypassing cached data.
Justification:
- Enhances performance, especially when dealing with large datasets or complex computations.
- Provides flexibility for users to control when data and settings are refreshed.
Implementation Details:
-
Cache Files:
sufficient_data.csv
: Stores the processed polling data that meets the minimum sample requirements.config.json
: Stores the current configuration parameters.results_df.csv
: Stores the results of the analysis across different periods.
-
Caching Strategy:
- On initial run, data is fetched, processed, and cached.
- On subsequent runs, cached data is loaded if available and configuration parameters haven't changed.
- If
force_refresh
is enabled, caches are cleared, and data is re-fetched and processed.
Implementation Example:
def load_and_process_data(config_vars, force_refresh=False):
cached_data = load_cached_data()
cached_results = load_cached_results_df()
cached_config = load_cached_config()
if not force_refresh and cached_data is not None and cached_config == config_vars:
st.info("Using cached data.")
sufficient_data_df = cached_data
results_df = cached_results
return sufficient_data_df, results_df
try:
# Update config with user-defined values
for key, value in config_vars.items():
setattr(config, key, value)
results_df = get_analysis_results(invalid_pollsters)
sufficient_data_df = preprocess_data(results_df)
save_cached_data(sufficient_data_df)
save_cached_results_df(results_df)
save_cached_config(config_vars)
return sufficient_data_df, results_df
except Exception as e:
st.error(f"An error occurred while processing data: {e}")
st.stop()
All configuration parameters are centralized in config.py
and can be adjusted via the Streamlit app or directly in the file.
Key Parameters:
-
Candidates to Analyze:
CANDIDATE_NAMES = ['Kamala Harris', 'Donald Trump']
-
Weight Multipliers:
TIME_DECAY_WEIGHT_MULTIPLIER = 1.0 SAMPLE_SIZE_WEIGHT_MULTIPLIER = 1.0 NORMALIZED_NUMERIC_GRADE_MULTIPLIER = 1.0 NORMALIZED_POLLSCORE_MULTIPLIER = 1.0 NORMALIZED_TRANSPARENCY_SCORE_MULTIPLIER = 1.0 POPULATION_WEIGHT_MULTIPLIER = 1.0 PARTISAN_WEIGHT_MULTIPLIER = 1.0 STATE_RANK_MULTIPLIER = 1.0 NATIONAL_POLL_WEIGHT = 1.0 # Adjust to increase or decrease influence of national polls
-
Favorability Weight:
FAVORABILITY_WEIGHT = 0.15
-
Weighting Strategy:
HEAVY_WEIGHT = True # True for multiplicative, False for additive
-
Time Decay Parameters:
DECAY_RATE = 1.0 HALF_LIFE_DAYS = 14
-
Minimum Samples Required:
MIN_SAMPLES_REQUIRED = 4
-
Partisan and Population Weights:
PARTISAN_WEIGHT = {True: 0.01, False: 1.0} POPULATION_WEIGHTS = { 'lv': 1.0, 'rv': 0.75, 'v': 0.5, 'a': 0.25, 'all': 0.01 }
-
Random Forest Parameters:
N_TREES = 1000 RANDOM_STATE = 42
Adjusting Configuration:
-
Via
config.py
:- Directly edit the
config.py
file to set desired values.
- Directly edit the
-
Via Streamlit App:
- The Streamlit interface provides sliders and input fields to adjust configuration parameters dynamically.
- Changes made through the app are saved and cached, ensuring consistency across sessions.
Implementation Example:
def configuration_form():
with st.sidebar:
# ... (UI components)
with st.form("config_form"):
favorability_weight = st.slider("Favorability Weight", 0.01, 1.0, float(config.FAVORABILITY_WEIGHT), 0.01)
heavy_weight = st.checkbox("Heavy Weight", config.HEAVY_WEIGHT)
purge_polls = st.checkbox("Purge Polls", config.PURGE_POLLS)
# ... (additional configuration inputs)
submitted = st.form_submit_button("Apply Changes and Run Analysis")
if submitted:
return {
"FAVORABILITY_WEIGHT": favorability_weight,
"HEAVY_WEIGHT": heavy_weight,
"PURGE_POLLS": purge_polls,
# ... (additional configuration parameters)
}
return None
Justification:
- Centralizing configuration parameters allows for easy adjustments and experimentation.
- Providing both file-based and UI-based configuration options caters to different user preferences and workflows.
By meticulously integrating multiple data sources and applying a comprehensive set of weighting factors—including the enhanced State Rank Weight that incorporates current forecasts and the Poll Purging Mechanism to exclude unreliable pollsters—this project offers a detailed and accurate analysis of presidential polling data. The consideration of factors such as pollster quality, sample size, partisanship, population type, state significance, and the exclusion of biased pollsters ensures that the adjusted poll results provide a realistic reflection of the electoral landscape.
Key Strengths:
- Robust Methodology: The use of mathematical models and justifiable weighting mechanisms enhances the credibility of the analysis.
- Incorporation of Current Forecasts: By integrating FiveThirtyEight's forecast data into the State Rank Weight, the model stays updated with the latest electoral dynamics.
- Poll Purging Mechanism: Excluding polls from specified pollsters prevents manipulation of polling averages, ensuring unbiased results.
- Customizability: Users can adjust parameters to explore different analytical perspectives or to align with specific research questions.
- Interactivity: The Streamlit app provides a user-friendly interface, making the analysis accessible to a broader audience.
Impact of the Weighting Choices:
- Each weighting factor addresses a specific aspect that can influence poll accuracy or relevance.
- The mathematical formulations are designed to be fair and justifiable, based on statistical principles and practical considerations.
- By providing transparency in the weighting mechanisms, users can understand and trust the adjustments made to the raw polling data.
To further enhance the project, several avenues can be explored:
-
Sensitivity Analysis:
- Objective: Assess how changes in weight assignments and parameter values affect the final results.
- Method: Systematically vary one parameter at a time while keeping others constant.
- Justification: Helps identify which factors have the most significant impact and ensures robustness.
-
Incorporation of Additional Data Sources:
- Objective: Enhance the comprehensiveness and robustness of the analysis by integrating more polling data.
- Method: Fetch and process data from other reputable sources, ensuring proper alignment with existing datasets.
- Justification: Diversifies the data pool and reduces potential biases from a single source.
-
Advanced Modeling Techniques:
- Objective: Capture more complex patterns and relationships in the data.
- Method: Implement machine learning models such as Gradient Boosting Machines, Neural Networks, or Bayesian models.
- Justification: May improve predictive accuracy and provide deeper insights.
-
Uncertainty Quantification:
- Objective: Provide more nuanced estimates of the uncertainty associated with predictions.
- Method: Use techniques like bootstrap resampling, Bayesian credible intervals, or probabilistic models.
- Justification: Enhances the interpretation of results, especially for decision-making purposes.
-
User Interface and Visualization Enhancements:
- Objective: Improve the accessibility and interpretability of the analysis.
- Method: Add interactive charts, maps, and explanatory texts to the Streamlit app.
- Justification: Makes the analysis more engaging and easier to understand for non-technical users.
-
Sophisticated Stratification Frame Construction:
- Objective: Enhance the representativeness of the sample by merging disparate data sources.
- Method: Integrate demographic and socioeconomic data to create a more complete stratification frame.
- Justification: Improves the accuracy of weight adjustments and the generalizability of results.
-
Integration with Multiple Forecasting Models:
- Objective: Improve predictive performance by combining forecasts.
- Method: Develop an ensemble method that averages or weights forecasts from multiple models.
- Justification: Leverages the strengths of different models and mitigates individual weaknesses.
-
Benchmarking Turnout Modeling Strategies:
- Objective: Evaluate and compare different approaches to modeling voter turnout.
- Method: Implement alternative turnout models and assess their impact on results.
- Justification: Ensures that the chosen approach is the most appropriate for the data and context.
-
Documentation and Reporting:
- Objective: Maintain clear and up-to-date documentation to facilitate collaboration and transparency.
- Method: Regularly update the readme and other documentation files to reflect new methodologies and findings.
- Justification: Enhances reproducibility and fosters community engagement.
Practical Steps:
- Data Preparation: Acquire and preprocess new data sources, ensuring compatibility with existing structures.
- Model Development: Experiment with and implement advanced algorithms, testing their performance.
- Evaluation Framework: Establish clear metrics and validation procedures to assess improvements.
- Iterative Testing: Use cross-validation and other techniques to refine models and prevent overfitting.
- Community Engagement: Encourage feedback and contributions from other analysts and stakeholders.
Ensure you have the following installed:
- Python 3.8 or higher
pip
package manager
-
Clone the Repository:
gh repo clone spencerthayer/2024-Election-Polling-Analysis cd 2024-Election-Polling-Analysis
-
Install Dependencies:
pip install -r requirements.txt
-
Create
purge.json
:- Place a
purge.json
file in the root directory of the project alongsideanalysis.py
andapp.py
. - The file should contain a list of pollsters to exclude from the analysis.
- Place a
-
Example Structure:
{ "invalid": [ "American Greatness", "American Pulse Research and Polling", "Bullfinch", "Daily Mail", "co/efficent", "Cygnal", "Echelon", "Emerson", "Fabrizio", "Fox News", ... "Victory Insights", "University of Austin", "The Wall Street Journal" ] }
-
Execute the Core Analysis:
python analysis.py
- This will fetch the latest polling and state data, perform the analysis, and output results to the console.
-
Launch the Streamlit App:
streamlit run app.py
- This will open the interactive user interface in your default web browser.
- Adjusting Parameters:
- Modify
config.py
to change default parameters. - Alternatively, use the Streamlit app to dynamically adjust settings via the sidebar.
- Modify
-
Cached Data Files:
- Located in the
data
directory. - Includes
sufficient_data.csv
,config.json
, andresults_df.csv
.
- Located in the
-
Clearing Cache:
- Use the "Force Refresh Data" option in the Streamlit app or manually delete cache files from the
data
directory.
- Use the "Force Refresh Data" option in the Streamlit app or manually delete cache files from the
-
Project Documentation:
- Comprehensive details available in this
readme.md
file.
- Comprehensive details available in this
-
Support:
- For issues or feature requests, please open an issue on the GitHub repository.
This project is licensed under the MIT License.
- FiveThirtyEight: For providing reliable and comprehensive polling and favorability data.
- 270 To Win: For offering detailed state-specific electoral data.
For any inquiries or feedback, please contact polling@spencerthayer.com.