Major Power Outage Risk Analysis
Name: Spencer Goodwin
Website Link: https://spencerg00dwin.github.io/power_outages_analysis/
Introduction
In this project, I analyzed a dataset of major power outages in the U.S. to explore patterns, causes, and impacts of these events. The dataset spans January 2000 to July 2016 and includes key attributes such as outage durations, causes, geographical regions, climate anomalies, and population data.
At first, I focused on cleaning the dataset and conducting exploratory data analysis to identify trends and anomalies. I specifically looked at climatic trends as a relation to power outage durations and averages. Along with this, visualizations such as the average outage duration by cause category, provided insights into the most impactful outage causes.
My model predicts outage durations based on both categorical and numerical features, such as climate region, cause category, El Niño/La Niña (ONI) anomoly index, and population. I started with a baseline Random Forest Regressor and later refined the model using hyperparameter tuning and feature selection techniques. I used evaluation metrics such as Mean Squared Error (MSE) and r-squared to measure the performance of my model.
The original raw DataFrame contains 1534 rows, corresponding to 1534 outages, and 57 columns. I focused on these columns:
Column Name | Definition |
---|---|
OBS | Observation number or row identifier in the dataset |
YEAR | Indicates the year when the outage event occurred |
MONTH | Indicates the month when the outage event occurred |
U.S._STATE | Represents all the states in the continental U.S. |
POSTAL.CODE | Represents the postal code of the U.S. states |
NERC.REGION | The North American Electric Reliability Corporation (NERC) regions involved in the outage event |
CLIMATE.REGION | U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.) |
ANOMALY.LEVEL | Represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W) |
CLIMATE.CATEGORY | Represents the climate episodes corresponding to the years. The categories—"Warm", "Cold" or "Normal" episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI) |
CAUSE.CATEGORY | Categories of all the events causing the major power outages |
CAUSE.CATEGORY.DETAIL | Detailed description of the event categories causing the major power outages |
OUTAGE.DURATION | Duration of outage events (in minutes) |
DEMAND.LOSS.MW | Amount of peak demand lost during an outage event (in Megawatt) [but in many cases, total demand is reported] |
POPULATION | Population in the U.S. state in a year |
OUTAGE.START | Combines the date and time when the outage event started (as reported by the corresponding Utility in the region) |
OUTAGE.RESTORATION | Combines the date and time when power was restored to all the customers (as reported by the corresponding Utility in the region) |
Cleaning and Exploratory Data Analysis
Cleaning
-
First I needed to clean the data. I started by dropping severeal columns that were unnecessary for my model. Specifically, ‘PCT_WATER_INLAND’, ‘PCT_WATER_TOT’, ‘PCT_LAND’, ‘AREAPCT_UC’, ‘AREAPCT_URBAN’,’POPDEN_RURAL’, ‘POPDEN_UC’, ‘POPDEN_URBAN’, ‘POPPCT_URBAN’,’PI.UTIL.OFUSA’, ‘UTIL.CONTRI’, ‘TOTAL.REALGSP’, ‘UTIL.REALGSP’, ‘PC.REALGSP.CHANGE’, ‘PC.REALGSP.REL’, ‘PC.REALGSP.USA’, ‘PC.REALGSP.STATE’,’IND.CUST.PCT’, ‘COM.CUST.PCT’, ‘RES.CUST.PCT’, ‘TOTAL.CUSTOMERS’, ‘IND.CUSTOMERS’,’COM.CUSTOMERS’, ‘RES.CUSTOMERS’, ‘IND.PERCEN’, ‘COM.PERCEN’, ‘RES.PERCEN’, ‘TOTAL.SALES’,’IND.SALES’, ‘COM.SALES’, ‘RES.SALES’, ‘TOTAL.PRICE’, ‘IND.PRICE’, ‘COM.PRICE’, ‘RES.PRICE’,’HURRICANE.NAMES’, ‘variables’, ‘POPPCT_UC’, ‘CUSTOMERS.AFFECTED’
-
I then went on to edit the ‘OUTAGE.START’ and ‘OUTAGE.RESTORATION’ columns to be datetime objects. This was done by combining OUTAGE.START.DATE and OUTAGE.START.TIME into a single column, as well as OUTAGE.RESTORATION.DATE AND OUTAGE.RESTORATION.TIME into another column.
Here are the first five rows of my cleaned DataFrame:
OBS | YEAR | MONTH | U.S. STATE | POSTAL CODE | NERC REGION | CLIMATE REGION | ANOMALY LEVEL | CLIMATE CATEGORY | CAUSE CATEGORY | CAUSE CATEGORY DETAIL | OUTAGE DURATION | DEMAND LOSS MW | POPULATION | OUTAGE START | OUTAGE RESTORATION |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2011 | 7 | Minnesota | MN | MRO | East North Central | -0.3 | normal | severe weather | nan | 3060 | nan | 5348119 | 2011-07-01 17:00:00 | 2011-07-03 20:00:00 |
2 | 2014 | 5 | Minnesota | MN | MRO | East North Central | -0.1 | normal | intentional attack | vandalism | 1 | nan | 5457125 | 2014-05-11 18:38:00 | 2014-05-11 18:39:00 |
3 | 2010 | 10 | Minnesota | MN | MRO | East North Central | -1.5 | cold | severe weather | heavy wind | 3000 | nan | 5310903 | 2010-10-26 20:00:00 | 2010-10-28 22:00:00 |
4 | 2012 | 6 | Minnesota | MN | MRO | East North Central | -0.1 | normal | severe weather | thunderstorm | 2550 | nan | 5380443 | 2012-06-19 04:30:00 | 2012-06-20 23:00:00 |
5 | 2015 | 7 | Minnesota | MN | MRO | East North Central | 1.2 | warm | severe weather | nan | 1740 | 250 | 5489594 | 2015-07-18 02:00:00 | 2015-07-19 07:00:00 |
Exploratory Data Analysis
Univariate
This bar chart visualizes the average power outage duration for each cause category.
This bar chart displays the number of power outages across different climate regions.
This pie chart illustrates the proportion of power outages attributed to each climate category.
Bivariate
This pivot table highlights the average outage durations across various climate regions and cause categories. It reveals that certain combinations, such as East North Central with specific causes, exhibit significantly higher average durations. This further indicates regional and causal dependencies of outages. This variability underscores the importance of tailoring mitigation strategies to both the climate region and the underlying causes of outages.
Climate Region | East North Central | Northeast | Central | Northwest | South | Southeast | Southwest | West | West North Central | Overall Average |
---|---|---|---|---|---|---|---|---|---|---|
Average Outage Duration | 26,435 | 216 | 322 | 702 | 296 | 554 | 114 | 525 | 61 | 1,817 |
Another Measure | 33,971 | 14,630 | 10,035 | 1 | 17,482 | NaN | 76 | 6,155 | NaN | 13,484 |
Third Measure | 2,376 | 196 | 346 | 374 | 326 | 505 | 266 | 858 | 24 | 430 |
Fourth Measure | 1 | 881 | 125 | 73 | 494 | NaN | 2 | 215 | 68 | 201 |
Fifth Measure | 733 | 2,655 | 1,410 | 898 | 1,164 | 2,865 | 2,275 | 2,028 | 440 | 1,468 |
Sixth Measure | 4,435 | 4,430 | 3,250 | 4,838 | 4,391 | 2,663 | 11,573 | 2,928 | 2,442 | 3,900 |
Seventh Measure | 2,610 | 774 | 2,695 | 141 | 866 | 169 | 329 | 364 | NaN | 733 |
Eighth Measure | 5,352 | 2,992 | 2,701 | 1,284 | 2,846 | 2,218 | 1,566 | 1,628 | 697 | 2,631 |
Imputation
I used probabilistic imputation to fill in missing values for OUTAGE.DURATION. This technique randomly samples from the observed values in the column to replace missing values, preserving the variability and distribution of the original data.
Framing
My prediction problem is to estimate outage duration (in minutes) based on pre-outage conditions, making this a regression task. The response variable OUTAGE.DURATION was chosen because it directly measures the severity and impact of a power outage which is critical for emergency planning. I evaluated my model using Mean Squared Error (MSE), as it penalizes large errors more heavily than alternatives like Mean Absolute Error (MAE). This focus is essential since large prediction errors for outage durations can lead to inadequate preparation or overestimation of resources.
I trained the model only on features available at the time of prediction, such as climate region, cause category, population data, and the El Niño/La Niña (ONI) anomaly index. I excluded any information that would only become available after the outage began like restoration times to avoid data leakage. By selecting these features I aimed to build a model that can provide actionable predictions based on information known before an outage occurs. My approach started with a baseline Random Forest Regressor and improved through hyperparameter tuning and feature engineering focusing on maximizing accuracy while maintaining interpretability.
Baseline Model
My model is a Random Forest Regressor that predicts power outage durations (in minutes) using two nominal features: CAUSE.CATEGORY
and CLIMATE.CATEGORY
, both encoded with OneHotEncoder. These features, representing the primary causes and climatic conditions of outages, were selected for their relevance to outage severity. The model achieved a baseline MSE of 53,884,406
and an r^2 of 0.121
‘, indicating substantial prediction errors. While the model provides a foundation, its performance is limited by the small number of features, and further improvement is needed. Adding quantitative features like ANOMALY.LEVEL
or POPULATION
and tuning hyperparameters could enhance predictive accuracy and reduce error.
Final Model
Feature Selection and Justification
For my Final Model, I added two additional features: ANOMALY.LEVEL
and POPULATION
. These features were chosen because they are highly relevant to the data-generating process. ANOMALY.LEVEL
, which represents deviations from typical climate patterns, captures the environmental conditions that often exacerbate power outages. For example, extreme weather events like storms or heatwaves can significantly increase outage durations. POPULATION
accounts for the scale of the affected area, as more densely populated regions may experience longer outages due to increased recovery complexity or infrastructure strain. These features complement the existing categorical predictors (CAUSE.CATEGORY
and CLIMATE.REGION
) by adding quantitative context.
Model and Hyperparameter Selection
I used a Random Forest Regressor for its robustness. The final hyperparameters were selected using GridSearchCV with 3-fold cross-validation. The best-performing hyperparameters were:
n_estimators
:300
(number of trees in the forest)max_depth
:10
(maximum depth of each tree)min_samples_split
:5
(minimum samples required to split an internal node)max_features
:'sqrt'
(number of features considered for splitting at each node)
GridSearchCV systematically evaluated combinations of these hyperparameters, allowing the model to balance complexity and performance.
Performance Improvement
The Final Model achieved an MSE of 53,115,669
and an r^2 of 0.132
, a significant improvement over the Baseline Model’s MSE of 53,884,406
. This reduction additional quantitative in error reflects the model’s enhanced ability to predict outage durations accurately by incorporating thefeatures and optimizing hyperparameters. The performance boost aligns with the importance of ANOMALY.LEVEL
and POPULATION
in capturing the underlying variability in the data. Overall, the Final Model has a higher predictive accuracy and generalizability compared to the baseline.