Predicting flight delays with machine learning

Check out the app here: Predicting Flight Delays

Being a consultant in my previous life, I have a deep fear of flight delays and spending Thursday nights at the airport.

Using Carrier On-Time Performance data from the Bureau of Transportation Statistics and daily weather data from the National Oceanic and Atmospheric Administration, I wanted to predict whether or not a flight would be delayed.

For the purpose of this model, I narrowed my data to top 40 airports (departure and arrival) to create more accuracy.

Checking out the data

Choosing your airline.

percentage_of_delays_eda_flights_datascience.png

Certain airlines saw less delays. Generally, those that experienced less delays also had a lower average minutes of delay.

The best airline to fly to avoid delays is Delta (with less than 16% of flights delayed and around 10 minutes of average delay) and the worst is JetBlue (with over 26% of flights delayed and over 20 minutes of average delay).

 

Best time to fly.

flightdelays_datascience_tableau.png

The least amount of delays occur in the morning and seem to stack throughout the day. As 5PM approaches, delays appear more prevalent. The best day to fly overall is Saturday, for the least delays, and the worst days to fly are Thursday and Friday (especially in the evening).

 

Incorporating Weather Data

While my initial thought was to have hourly weather data, I was unable to find a free weather API with historical information to do so. Instead, I settled on data from the NOAA, which provided daily precipitation levels for locations as well as binary values for whether certain weather occurred that day (ice fog / freeze fog, heavy fog, thunder, ice pellets / snow, hail, glaze or rime, dust / sand / volcanic ash, smoke or haze, drifting snow, tornado / tunnel clouds) .

However, after feature engineering and understanding the coefficients of my model better, it turned out only precipitation and glaze/rime had a significant effect on how my model performed.

Feature Engineering

My original features in my model were: Original Airport, Destination Airport, Month of Departure, Day of Week of Departure, Hour of Departure, Carrier

Continual feature engineering resulted in additional features. I added the total passenger boarding at origin in the most recent year, which I scraped from wikipedia. I wanted a way to weigh airports differently since busier airports would probably experience more delays.

Similarly, I added some weights to certain original airports that saw more frequent delays as well as some weather-related features.

feature_engineering_datascience.png

Choosing the right model

KNN:
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

Logistic Regression:
logit = LogisticRegression(class_weight={1:3, 0:1}, 
                           solver = 'liblinear')
logit.fit(X_train,y_train)
y_pred = logit.predict(X_test)

Decision Tree Classifier:
dTree = DecisionTreeClassifier(class_weight = 'balanced')
dTree.fit(X_train,y_train)
y_pred = dTree.predict(X_test)

Random Forest Classifier:
rForest = RandomForestClassifier(class_weight = 'balanced')
rForest.fit(X_train,y_train)
y_pred = rForest.predict(X_test)
 

After trying different models (KNN, Logistic Regression, Decision Tree and Random Forest Classifier), I decided on logistic regression because of its coefficient interpretability, that would allow me to better feature engineer.

While random forest and decision tree model saw a better training and test score, the F1 score remained lower despite changing class weights. On the other hand, I was able to improve my logistic regression F1 score to .40.

The final logistic regression performed as follows (with class weights of 1:3, 0:1):

Cross Validation R2: 0.80

Accuracy score: 0.73,

F1 Score: 0.40

Flasking it all together

The model was put together with a front-end aspect using flask and deployed using heroku. To view the app, visit: predictingflightdelays.herokuapp.com

flask_prediction_data_science.gif
Next
Next

Earnings Call Transcripts and Stock Correlation