Skip to the content.

Major Power Outage Risk Analysis

Name: Spencer Goodwin

Website Link: https://spencerg00dwin.github.io/power_outages_analysis/

Introduction

In this project, I analyzed a dataset of major power outages in the U.S. to explore patterns, causes, and impacts of these events. The dataset spans January 2000 to July 2016 and includes key attributes such as outage durations, causes, geographical regions, climate anomalies, and population data.

At first, I focused on cleaning the dataset and conducting exploratory data analysis to identify trends and anomalies. I specifically looked at climatic trends as a relation to power outage durations and averages. Along with this, visualizations such as the average outage duration by cause category, provided insights into the most impactful outage causes.

My model predicts outage durations based on both categorical and numerical features, such as climate region, cause category, El Niño/La Niña (ONI) anomoly index, and population. I started with a baseline Random Forest Regressor and later refined the model using hyperparameter tuning and feature selection techniques. I used evaluation metrics such as Mean Squared Error (MSE) and r-squared to measure the performance of my model.

The original raw DataFrame contains 1534 rows, corresponding to 1534 outages, and 57 columns. I focused on these columns:

Column Name Definition
OBS Observation number or row identifier in the dataset
YEAR Indicates the year when the outage event occurred
MONTH Indicates the month when the outage event occurred
U.S._STATE Represents all the states in the continental U.S.
POSTAL.CODE Represents the postal code of the U.S. states
NERC.REGION The North American Electric Reliability Corporation (NERC) regions involved in the outage event
CLIMATE.REGION U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)
ANOMALY.LEVEL Represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season. It is estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region (5°N to 5°S, 120–170°W)
CLIMATE.CATEGORY Represents the climate episodes corresponding to the years. The categories—"Warm", "Cold" or "Normal" episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
CAUSE.CATEGORY Categories of all the events causing the major power outages
CAUSE.CATEGORY.DETAIL Detailed description of the event categories causing the major power outages
OUTAGE.DURATION Duration of outage events (in minutes)
DEMAND.LOSS.MW Amount of peak demand lost during an outage event (in Megawatt) [but in many cases, total demand is reported]
POPULATION Population in the U.S. state in a year
OUTAGE.START Combines the date and time when the outage event started (as reported by the corresponding Utility in the region)
OUTAGE.RESTORATION Combines the date and time when power was restored to all the customers (as reported by the corresponding Utility in the region)

Cleaning and Exploratory Data Analysis

Cleaning

  1. First I needed to clean the data. I started by dropping severeal columns that were unnecessary for my model. Specifically, ‘PCT_WATER_INLAND’, ‘PCT_WATER_TOT’, ‘PCT_LAND’, ‘AREAPCT_UC’, ‘AREAPCT_URBAN’,’POPDEN_RURAL’, ‘POPDEN_UC’, ‘POPDEN_URBAN’, ‘POPPCT_URBAN’,’PI.UTIL.OFUSA’, ‘UTIL.CONTRI’, ‘TOTAL.REALGSP’, ‘UTIL.REALGSP’, ‘PC.REALGSP.CHANGE’, ‘PC.REALGSP.REL’, ‘PC.REALGSP.USA’, ‘PC.REALGSP.STATE’,’IND.CUST.PCT’, ‘COM.CUST.PCT’, ‘RES.CUST.PCT’, ‘TOTAL.CUSTOMERS’, ‘IND.CUSTOMERS’,’COM.CUSTOMERS’, ‘RES.CUSTOMERS’, ‘IND.PERCEN’, ‘COM.PERCEN’, ‘RES.PERCEN’, ‘TOTAL.SALES’,’IND.SALES’, ‘COM.SALES’, ‘RES.SALES’, ‘TOTAL.PRICE’, ‘IND.PRICE’, ‘COM.PRICE’, ‘RES.PRICE’,’HURRICANE.NAMES’, ‘variables’, ‘POPPCT_UC’, ‘CUSTOMERS.AFFECTED’

  2. I then went on to edit the ‘OUTAGE.START’ and ‘OUTAGE.RESTORATION’ columns to be datetime objects. This was done by combining OUTAGE.START.DATE and OUTAGE.START.TIME into a single column, as well as OUTAGE.RESTORATION.DATE AND OUTAGE.RESTORATION.TIME into another column.

Here are the first five rows of my cleaned DataFrame:

OBS YEAR MONTH U.S. STATE POSTAL CODE NERC REGION CLIMATE REGION ANOMALY LEVEL CLIMATE CATEGORY CAUSE CATEGORY CAUSE CATEGORY DETAIL OUTAGE DURATION DEMAND LOSS MW POPULATION OUTAGE START OUTAGE RESTORATION
1 2011 7 Minnesota MN MRO East North Central -0.3 normal severe weather nan 3060 nan 5348119 2011-07-01 17:00:00 2011-07-03 20:00:00
2 2014 5 Minnesota MN MRO East North Central -0.1 normal intentional attack vandalism 1 nan 5457125 2014-05-11 18:38:00 2014-05-11 18:39:00
3 2010 10 Minnesota MN MRO East North Central -1.5 cold severe weather heavy wind 3000 nan 5310903 2010-10-26 20:00:00 2010-10-28 22:00:00
4 2012 6 Minnesota MN MRO East North Central -0.1 normal severe weather thunderstorm 2550 nan 5380443 2012-06-19 04:30:00 2012-06-20 23:00:00
5 2015 7 Minnesota MN MRO East North Central 1.2 warm severe weather nan 1740 250 5489594 2015-07-18 02:00:00 2015-07-19 07:00:00

Exploratory Data Analysis

Univariate

This bar chart visualizes the average power outage duration for each cause category.

This bar chart displays the number of power outages across different climate regions.

This pie chart illustrates the proportion of power outages attributed to each climate category.

Bivariate

This pivot table highlights the average outage durations across various climate regions and cause categories. It reveals that certain combinations, such as East North Central with specific causes, exhibit significantly higher average durations. This further indicates regional and causal dependencies of outages. This variability underscores the importance of tailoring mitigation strategies to both the climate region and the underlying causes of outages.

Climate Region East North Central Northeast Central Northwest South Southeast Southwest West West North Central Overall Average
Average Outage Duration 26,435 216 322 702 296 554 114 525 61 1,817
Another Measure 33,971 14,630 10,035 1 17,482 NaN 76 6,155 NaN 13,484
Third Measure 2,376 196 346 374 326 505 266 858 24 430
Fourth Measure 1 881 125 73 494 NaN 2 215 68 201
Fifth Measure 733 2,655 1,410 898 1,164 2,865 2,275 2,028 440 1,468
Sixth Measure 4,435 4,430 3,250 4,838 4,391 2,663 11,573 2,928 2,442 3,900
Seventh Measure 2,610 774 2,695 141 866 169 329 364 NaN 733
Eighth Measure 5,352 2,992 2,701 1,284 2,846 2,218 1,566 1,628 697 2,631

Imputation

I used probabilistic imputation to fill in missing values for OUTAGE.DURATION. This technique randomly samples from the observed values in the column to replace missing values, preserving the variability and distribution of the original data.

Framing

My prediction problem is to estimate outage duration (in minutes) based on pre-outage conditions, making this a regression task. The response variable OUTAGE.DURATION was chosen because it directly measures the severity and impact of a power outage which is critical for emergency planning. I evaluated my model using Mean Squared Error (MSE), as it penalizes large errors more heavily than alternatives like Mean Absolute Error (MAE). This focus is essential since large prediction errors for outage durations can lead to inadequate preparation or overestimation of resources.

I trained the model only on features available at the time of prediction, such as climate region, cause category, population data, and the El Niño/La Niña (ONI) anomaly index. I excluded any information that would only become available after the outage began like restoration times to avoid data leakage. By selecting these features I aimed to build a model that can provide actionable predictions based on information known before an outage occurs. My approach started with a baseline Random Forest Regressor and improved through hyperparameter tuning and feature engineering focusing on maximizing accuracy while maintaining interpretability.

Baseline Model

My model is a Random Forest Regressor that predicts power outage durations (in minutes) using two nominal features: CAUSE.CATEGORY and CLIMATE.CATEGORY, both encoded with OneHotEncoder. These features, representing the primary causes and climatic conditions of outages, were selected for their relevance to outage severity. The model achieved a baseline MSE of 53,884,406 and an r^2 of 0.121, indicating substantial prediction errors. While the model provides a foundation, its performance is limited by the small number of features, and further improvement is needed. Adding quantitative features like ANOMALY.LEVEL or POPULATION and tuning hyperparameters could enhance predictive accuracy and reduce error.

Final Model

Feature Selection and Justification

For my Final Model, I added two additional features: ANOMALY.LEVEL and POPULATION. These features were chosen because they are highly relevant to the data-generating process. ANOMALY.LEVEL, which represents deviations from typical climate patterns, captures the environmental conditions that often exacerbate power outages. For example, extreme weather events like storms or heatwaves can significantly increase outage durations. POPULATION accounts for the scale of the affected area, as more densely populated regions may experience longer outages due to increased recovery complexity or infrastructure strain. These features complement the existing categorical predictors (CAUSE.CATEGORY and CLIMATE.REGION) by adding quantitative context.

Model and Hyperparameter Selection

I used a Random Forest Regressor for its robustness. The final hyperparameters were selected using GridSearchCV with 3-fold cross-validation. The best-performing hyperparameters were:

GridSearchCV systematically evaluated combinations of these hyperparameters, allowing the model to balance complexity and performance.

Performance Improvement

The Final Model achieved an MSE of 53,115,669 and an r^2 of 0.132, a significant improvement over the Baseline Model’s MSE of 53,884,406. This reduction additional quantitative in error reflects the model’s enhanced ability to predict outage durations accurately by incorporating thefeatures and optimizing hyperparameters. The performance boost aligns with the importance of ANOMALY.LEVEL and POPULATION in capturing the underlying variability in the data. Overall, the Final Model has a higher predictive accuracy and generalizability compared to the baseline.