Major Power Outage Risk Analysis

Name: Spencer Goodwin

Website Link: https://spencerg00dwin.github.io/power_outages_analysis/

Introduction

In this project, I analyzed a dataset of major power outages in the U.S. to explore patterns, causes, and impacts of these events. The dataset spans January 2000 to July 2016 and includes key attributes such as outage durations, causes, geographical regions, climate anomalies, and population data.

At first, I focused on cleaning the dataset and conducting exploratory data analysis to identify trends and anomalies. I specifically looked at climatic trends as a relation to power outage durations and averages. Along with this, visualizations such as the average outage duration by cause category, provided insights into the most impactful outage causes.

My model predicts outage durations based on both categorical and numerical features, such as climate region, cause category, El Niño/La Niña (ONI) anomoly index, and population. I started with a baseline Random Forest Regressor and later refined the model using hyperparameter tuning and feature selection techniques. I used evaluation metrics such as Mean Squared Error (MSE) and r-squared to measure the performance of my model.

The original raw DataFrame contains 1534 rows, corresponding to 1534 outages, and 57 columns. I focused on these columns:

Column Name	Definition
OBS	Observation number or row identifier in the dataset
YEAR	Indicates the year when the outage event occurred
MONTH	Indicates the month when the outage event occurred
U.S._STATE	Represents all the states in the continental U.S.
POSTAL.CODE	Represents the postal code of the U.S. states
NERC.REGION	The North American Electric Reliability Corporation (NERC) regions involved in the outage event
CLIMATE.REGION	U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)
ANOMALY.LEVEL	Represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W)
CLIMATE.CATEGORY	Represents the climate episodes corresponding to the years. The categories—"Warm", "Cold" or "Normal" episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
CAUSE.CATEGORY	Categories of all the events causing the major power outages
CAUSE.CATEGORY.DETAIL	Detailed description of the event categories causing the major power outages
OUTAGE.DURATION	Duration of outage events (in minutes)
DEMAND.LOSS.MW	Amount of peak demand lost during an outage event (in Megawatt) [but in many cases, total demand is reported]
POPULATION	Population in the U.S. state in a year
OUTAGE.START	Combines the date and time when the outage event started (as reported by the corresponding Utility in the region)
OUTAGE.RESTORATION	Combines the date and time when power was restored to all the customers (as reported by the corresponding Utility in the region)

Cleaning and Exploratory Data Analysis

Cleaning

First I needed to clean the data. I started by dropping severeal columns that were unnecessary for my model. Specifically, ‘PCT_WATER_INLAND’, ‘PCT_WATER_TOT’, ‘PCT_LAND’, ‘AREAPCT_UC’, ‘AREAPCT_URBAN’,’POPDEN_RURAL’, ‘POPDEN_UC’, ‘POPDEN_URBAN’, ‘POPPCT_URBAN’,’PI.UTIL.OFUSA’, ‘UTIL.CONTRI’, ‘TOTAL.REALGSP’, ‘UTIL.REALGSP’, ‘PC.REALGSP.CHANGE’, ‘PC.REALGSP.REL’, ‘PC.REALGSP.USA’, ‘PC.REALGSP.STATE’,’IND.CUST.PCT’, ‘COM.CUST.PCT’, ‘RES.CUST.PCT’, ‘TOTAL.CUSTOMERS’, ‘IND.CUSTOMERS’,’COM.CUSTOMERS’, ‘RES.CUSTOMERS’, ‘IND.PERCEN’, ‘COM.PERCEN’, ‘RES.PERCEN’, ‘TOTAL.SALES’,’IND.SALES’, ‘COM.SALES’, ‘RES.SALES’, ‘TOTAL.PRICE’, ‘IND.PRICE’, ‘COM.PRICE’, ‘RES.PRICE’,’HURRICANE.NAMES’, ‘variables’, ‘POPPCT_UC’, ‘CUSTOMERS.AFFECTED’
I then went on to edit the ‘OUTAGE.START’ and ‘OUTAGE.RESTORATION’ columns to be datetime objects. This was done by combining OUTAGE.START.DATE and OUTAGE.START.TIME into a single column, as well as OUTAGE.RESTORATION.DATE AND OUTAGE.RESTORATION.TIME into another column.

Here are the first five rows of my cleaned DataFrame:

OBS	YEAR	MONTH	U.S. STATE	POSTAL CODE	NERC REGION	CLIMATE REGION	ANOMALY LEVEL	CLIMATE CATEGORY	CAUSE CATEGORY	CAUSE CATEGORY DETAIL	OUTAGE DURATION	DEMAND LOSS MW	POPULATION	OUTAGE START	OUTAGE RESTORATION
1	2011	7	Minnesota	MN	MRO	East North Central	-0.3	normal	severe weather	nan	3060	nan	5348119	2011-07-01 17:00:00	2011-07-03 20:00:00
2	2014	5	Minnesota	MN	MRO	East North Central	-0.1	normal	intentional attack	vandalism	1	nan	5457125	2014-05-11 18:38:00	2014-05-11 18:39:00
3	2010	10	Minnesota	MN	MRO	East North Central	-1.5	cold	severe weather	heavy wind	3000	nan	5310903	2010-10-26 20:00:00	2010-10-28 22:00:00
4	2012	6	Minnesota	MN	MRO	East North Central	-0.1	normal	severe weather	thunderstorm	2550	nan	5380443	2012-06-19 04:30:00	2012-06-20 23:00:00
5	2015	7	Minnesota	MN	MRO	East North Central	1.2	warm	severe weather	nan	1740	250	5489594	2015-07-18 02:00:00	2015-07-19 07:00:00

Exploratory Data Analysis

Univariate

This bar chart visualizes the average power outage duration for each cause category.

This bar chart displays the number of power outages across different climate regions.

This pie chart illustrates the proportion of power outages attributed to each climate category.

Bivariate

This pivot table highlights the average outage durations across various climate regions and cause categories. It reveals that certain combinations, such as East North Central with specific causes, exhibit significantly higher average durations. This further indicates regional and causal dependencies of outages. This variability underscores the importance of tailoring mitigation strategies to both the climate region and the underlying causes of outages.

Climate Region	East North Central	Northeast	Central	Northwest	South	Southeast	Southwest	West	West North Central	Overall Average
Average Outage Duration	26,435	216	322	702	296	554	114	525	61	1,817
Another Measure	33,971	14,630	10,035	1	17,482	NaN	76	6,155	NaN	13,484
Third Measure	2,376	196	346	374	326	505	266	858	24	430
Fourth Measure	1	881	125	73	494	NaN	2	215	68	201
Fifth Measure	733	2,655	1,410	898	1,164	2,865	2,275	2,028	440	1,468
Sixth Measure	4,435	4,430	3,250	4,838	4,391	2,663	11,573	2,928	2,442	3,900
Seventh Measure	2,610	774	2,695	141	866	169	329	364	NaN	733
Eighth Measure	5,352	2,992	2,701	1,284	2,846	2,218	1,566	1,628	697	2,631

Imputation

I used probabilistic imputation to fill in missing values for OUTAGE.DURATION. This technique randomly samples from the observed values in the column to replace missing values, preserving the variability and distribution of the original data.

Framing

My prediction problem is to estimate outage duration (in minutes) based on pre-outage conditions, making this a regression task. The response variable OUTAGE.DURATION was chosen because it directly measures the severity and impact of a power outage which is critical for emergency planning. I evaluated my model using Mean Squared Error (MSE), as it penalizes large errors more heavily than alternatives like Mean Absolute Error (MAE). This focus is essential since large prediction errors for outage durations can lead to inadequate preparation or overestimation of resources.

I trained the model only on features available at the time of prediction, such as climate region, cause category, population data, and the El Niño/La Niña (ONI) anomaly index. I excluded any information that would only become available after the outage began like restoration times to avoid data leakage. By selecting these features I aimed to build a model that can provide actionable predictions based on information known before an outage occurs. My approach started with a baseline Random Forest Regressor and improved through hyperparameter tuning and feature engineering focusing on maximizing accuracy while maintaining interpretability.

Baseline Model

My model is a Random Forest Regressor that predicts power outage durations (in minutes) using two nominal features: CAUSE.CATEGORY and CLIMATE.CATEGORY, both encoded with OneHotEncoder. These features, representing the primary causes and climatic conditions of outages, were selected for their relevance to outage severity. The model achieved a baseline MSE of 53,884,406 and an r^2 of 0.121‘, indicating substantial prediction errors. While the model provides a foundation, its performance is limited by the small number of features, and further improvement is needed. Adding quantitative features like ANOMALY.LEVEL or POPULATION and tuning hyperparameters could enhance predictive accuracy and reduce error.

Final Model

Feature Selection and Justification

For my Final Model, I added two additional features: ANOMALY.LEVEL and POPULATION. These features were chosen because they are highly relevant to the data-generating process. ANOMALY.LEVEL, which represents deviations from typical climate patterns, captures the environmental conditions that often exacerbate power outages. For example, extreme weather events like storms or heatwaves can significantly increase outage durations. POPULATION accounts for the scale of the affected area, as more densely populated regions may experience longer outages due to increased recovery complexity or infrastructure strain. These features complement the existing categorical predictors (CAUSE.CATEGORY and CLIMATE.REGION) by adding quantitative context.

Model and Hyperparameter Selection

I used a Random Forest Regressor for its robustness. The final hyperparameters were selected using GridSearchCV with 3-fold cross-validation. The best-performing hyperparameters were:

n_estimators: 300 (number of trees in the forest)
max_depth: 10 (maximum depth of each tree)
min_samples_split: 5 (minimum samples required to split an internal node)
max_features: 'sqrt' (number of features considered for splitting at each node)

GridSearchCV systematically evaluated combinations of these hyperparameters, allowing the model to balance complexity and performance.

Performance Improvement

The Final Model achieved an MSE of 53,115,669 and an r^2 of 0.132, a significant improvement over the Baseline Model’s MSE of 53,884,406. This reduction additional quantitative in error reflects the model’s enhanced ability to predict outage durations accurately by incorporating thefeatures and optimizing hyperparameters. The performance boost aligns with the importance of ANOMALY.LEVEL and POPULATION in capturing the underlying variability in the data. Overall, the Final Model has a higher predictive accuracy and generalizability compared to the baseline.