NCAA Football Coach Salary Recommendation Project

NCAA FOOTBALL SALARY RECOMMENDATIONS

The purpose of this case study is to take real world data and provide insight, understanding and wisdom to a decision maker. In this case specifically we are dealing with the salaries of NCAA Division I Football Coaches. We are exploring how we can recommend the best salary for Syracuse University's head football coach? First, lets Import the packages we will be using:

# import packages for analysis and modeling
import pandas as pd #data frame operations
import numpy as np #arrays and math functions
from scipy.stats import uniform #for training and test splits
import statsmodels.api as sm  # statistical models (including regression)
import statsmodels.formula.api as smf  # R-like model specification
import matplotlib.pyplot as plt #2D plotting
import seaborn as sns #seaborn for plotting
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

Data Preparation

While we were supplied a coaches dataset to work with initially, this did not provide all the data needed to provide a thorough analysis. The following for data sets were combined to give us the information needed:

Coaches data – 2014 coaches data set provided containing school, conference, coach and various types of pay (school pay, total pay, bonus, bonus paid, assistant pay, buyout)
Stadium capacity data – data scraped from www.collegegridirons.com containing stadium name, school, conference, capacity, and year opened
Graduation rates – 2012-2013 school year graduation reates obrainted from the NCAA containing borth GSR and FGR. 2006 cohort graduation rates were used. Contains Year, School. Conference, Sport, State, GSR, FGR
Win Loss Record – Win Loss record for last year (2019) scrapped from www.teamrankings.com

Coaches Data

The initial dataset was supplied to us and is stored in my git repository which was inported into a dataframe. A quick computation of some of the descriptive statistics show that there were 129 total entries (all unique) for school names beloning to 11 unique conferences. In addition we see the count and unique entries for all the columns in the dataset. I also checked for null values and found nine in the dataframe

        School Conference              Coach  ... BonusPaid AssistantPay Buyout
count      129        129                129  ...       129          129    129
unique     129         11                129  ...        51            1    102
top     Hawaii    Big Ten  Bronco Mendenhall  ...        --          $0      --
freq         1         14                  1  ...        41          129     22

[4 rows x 9 columns]
********************************************************************************
Are there any null values? False
********************************************************************************
Columns with null values:
 School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
BonusPaid       0
AssistantPay    0
Buyout          0
dtype: int64
Number of observations:
 129

A quick look at the head of the dataframe shows us the first 5 rows. We see School, Conference, Coach, SchoolPay, TotalPay, Bonus, BonusPaid, AssistantPay and Buyout. The fist thing we notice is that the dollar amounts are objects which will need to be covereted into mueric format.

coaches.head()

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	AssistantPay	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000	885000	247000	--	$0	--
1	Akron	MAC	Terry Bowden	$411,000	$412,500	$225,000	$50,000	$0	$688,500
2	Alabama	SEC	Nick Saban	$8,307,000	$8,307,000	$1,100,000	$500,000	$0	$33,600,000
3	Alabama at Birmingham	C-USA	Bill Clark	$900,000	$900,000	$950,000	$165,471	$0	$3,847,500
4	Appalachian State	Sun Belt	Scott Satterfield	$712,500	$712,500	$295,000	$145,000	$0	$2,160,417

Our dollar amounts are actually stored as astrings so I ran a function to convert them to numeric values and reinspected.

# Clean up: convert dollar amounts from object to numeric
coaches.head()

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000.0	885000.0	247000.0	0.0	0.0
1	Akron	MAC	Terry Bowden	411000.0	412500.0	225000.0	50000.0	688500.0
2	Alabama	SEC	Nick Saban	8307000.0	8307000.0	1100000.0	500000.0	33600000.0
3	Alabama at Birmingham	C-USA	Bill Clark	900000.0	900000.0	950000.0	165471.0	3847500.0
4	Appalachian State	Sun Belt	Scott Satterfield	712500.0	712500.0	295000.0	145000.0	2160417.0

Webscraping Stadium Capacity Data

Syed Nazrul has a great Webscraping tutorial he created on you tube using pokemon data which you can see visit below

PokeScrape 1: Scraping an online PokeDex (Pokemon data) with Python

I went to the College Gridirons Website and scrapped data on college football stadiums by capacity. This data included the Stadium Name, The College, Conference, Capacity and the year it opened. Once scrapped, the data required extensive cleaned up and once completed it was stored in a new dataframe.

# create Stadiums df
stadiums.head()

	Stadium	School	Conference	Capacity	Opened
58	Falcon Stadium	Air Force	Mountain West	52237.0	1962
107	InfoCision Stadium	Akron	MAC	30000.0	2009
5	Bryant Denny Stadium	Alabama	SEC	101821.0	1929
118	Kidd Brewer Stadium	Appalachian State	Sun Belt	24150.0	1962
50	Arizona Stadium	Arizona	Pac 12	56037.0	1928

Import Graduation Rates Data

From Bleacher report September 4, 2013
https://bleacherreport.com/articles/1731218-college-football-schools-with-the-best-and-worst-graduation-rates#slide0

FGR - Federal Graduation Rate - This method keeps track of first-time freshmen who are full-time students. It follows those students over the course of a six-year period to see if they graduate at the same institution at which they began their secondary education. This method does not account for students who transfer.
GSR - Graduation Success Rate - Uses the same type of data as the FGR, but the GSR data also takes into account students who transfer into a school and graduate from that same school. It also doesn’t count against a school’s rate when a student-athlete in good standing transfers out of that institution to attend another.

The 2013 Graduation Rate data was obtained using the 2006 cohort graduation rates for both GSR and FGR. There was some clean up required to ensure that the school names matched the entries in the original coaches dataset. For example, US Naval Academy was listed as Navy in the coaches set.

# import graduation rate data
# obtained from https://web3.ncaa.org/aprsearch/gsrsearch
# Sport = Football, Year = 2012-2013
gradrates.head()

	Cohort Year	School	Conference	Sport	State	GSR	FGR
210	2006	Air Force	Mountain West Conference	Football	CO	93	NaN
0	2006	Akron	Mid-American Conference	Football	OH	58	52.0
3	2006	Alabama	Southeastern Conference	Football	AL	73	57.0
1	2006	Alabama A&M	Southwestern Athletic Conf.	Football	AL	40	45.0
2	2006	Alabama State	Southwestern Athletic Conf.	Football	AL	81	59.0

Win Loss Record

2019 Win-Loss Record was scrapped from the website www.teamrankings.com and a new dataframe was created. This data also required extensive cleanup to ensure the school names match and was then stored in its own dataframe.

#Scrape 2019 Win-Loss Records
#create url
url='https://www.teamrankings.com/ncf/trends/win_trends/?range=yearly_2019'
WinLoss.head()

	School	Win-Loss Record2019	WinPercentage2019	MOV	ATS	WinPer2019
9	Air Force	11-2-0	84.6%	14.2	+5.7	0.846
129	Akron	0-12-0	0.0%	-25.8	-11.6	0.000
14	Alabama	11-2-0	84.6%	28.6	+0.6	0.846
2	Appalachian State	13-1-0	92.9%	18.8	+4.9	0.929
108	Arizona	4-8-0	33.3%	-8.8	-5.5	0.333

All the data has now been obtained and was merged into one dataframe which I named "finalcoaches" consisting of the coaches, gradrates, stadium capacity and win-loss record sets. The sets were merged on School Name.

# Merge all dfs to create finalcoaches
merge1 = coaches.merge(stadiums, on='School', how='left') #Add stadium data to coaches, match on School to create merge1
merge2 = merge1.merge(gradrates, on='School', how='left') #Add gradrates data to merge1, match on School to create merge2
finalcoaches = merge2.merge(WinLoss, on='School', how='left') #Add WinLoss data to merge2, match on School to create finalcoaches

#drop uneeded columns
finalcoaches = finalcoaches.drop(['Stadium', 'Conference_y', 'Conference',
                                  'Sport',
                                  'Opened','MOV','ATS'], axis=1)

# rename
finalcoaches.rename(columns={'Conference_x':'Conference'}, inplace=True)

# using apply function to create a new column for graphing purposes (thousands)
finalcoaches['TotalPayThousands'] = finalcoaches.apply(lambda row: (row.TotalPay * 0.001), axis = 1)

# compute descriptive statistics for original variables
print(finalcoaches.describe())
print("*"*80)
print(finalcoaches.head())

          SchoolPay      TotalPay  ...  WinPer2019  TotalPayThousands
count  1.290000e+02  1.290000e+02  ...  129.000000         129.000000
mean   2.335563e+06  2.342113e+06  ...    0.521217        2342.113140
std    1.898654e+06  1.903114e+06  ...    0.217709        1903.113765
min    0.000000e+00  0.000000e+00  ...    0.000000           0.000000
25%    7.625700e+05  7.625700e+05  ...    0.333000         762.570000
50%    1.800000e+06  1.830000e+06  ...    0.538000        1830.000000
75%    3.550000e+06  3.550000e+06  ...    0.643000        3550.000000
max    8.307000e+06  8.307000e+06  ...    1.000000        8307.000000

[8 rows x 12 columns]
********************************************************************************
                  School Conference  ... WinPer2019  TotalPayThousands
0              Air Force   Mt. West  ...      0.846              885.0
1                  Akron        MAC  ...      0.000              412.5
2                Alabama        SEC  ...      0.846             8307.0
3  Alabama at Birmingham      C-USA  ...      0.643              900.0
4      Appalachian State   Sun Belt  ...      0.929              712.5

[5 rows x 18 columns]

Upon review, it is observed that there are some null values from the stadium capacity and win record data. We will now take a look at these rows for further investigation.

#check for null values in the dataframe
are_null_values = finalcoaches.isnull().values.any()
num_nulls = finalcoaches.isnull().sum()
print("*"*80)
print("Are there any null values?",are_null_values)
print("*"*80)
print("Columns with null values: ",num_nulls)
print("Number of observations: ",len(finalcoaches))

********************************************************************************
Are there any null values? True
********************************************************************************
Columns with null values:  School                 0
Conference             0
Coach                  0
SchoolPay              0
TotalPay               0
Bonus                  0
BonusPaid              0
AssistantPay           0
Buyout                 0
Capacity               1
Cohort Year            4
State                  4
GSR                    4
FGR                    8
Win-Loss Record2019    0
WinPercentage2019      0
WinPer2019             0
TotalPayThousands      0
dtype: int64
Number of observations:  129

There were 8 total schools missing FGR, 4 of which were also missing their GSR. Since we have more data for GSR we will use that information rather than the FGR. In addition, I believe this is the rate we should use as this is the method the NCAA is currently using to evaluate graduation rates. The schools missing the GSR values would be dropped from the set.

# Schools missing FGR (Federal Graduation Rate) and GSR (Graduate Success Rate)
finalcoaches[finalcoaches['FGR'].isna()]

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	Buyout	Capacity	Cohort Year	State	GSR	FGR	Win-Loss Record2019	WinPercentage2019	WinPer2019	TotalPayThousands
0	Air Force	Mt. West	Troy Calhoun	885000.0	885000.0	247000.0	0.0	0.0	52237.0	2006.0	CO	93.0	NaN	11-2-0	84.6%	0.846	885.000
9	Army	Ind.	Jeff Monken	932521.0	932521.0	0.0	0.0	0.0	38000.0	2006.0	NY	89.0	NaN	5-8-0	38.5%	0.385	932.521
21	Charlotte	C-USA	Brad Lambert	625000.0	625000.0	120000.0	0.0	556389.0	15314.0	NaN	NaN	NaN	NaN	7-6-0	53.8%	0.538	625.000
38	Georgia State	Sun Belt	Shawn Elliott	569000.0	569000.0	220000.0	60000.0	1500000.0	23000.0	NaN	NaN	NaN	NaN	7-6-0	53.8%	0.538	569.000
69	Navy	AAC	Ken Niumatalolo	2163000.0	2163000.0	0.0	0.0	0.0	34000.0	2006.0	MD	84.0	NaN	11-2-0	84.6%	0.846	2163.000
85	Old Dominion	C-USA	Bobby Wilder	654667.0	654667.0	504895.0	0.0	1200000.0	20118.0	2006.0	VA	50.0	NaN	1-11-0	8.3%	0.083	654.667
95	South Alabama	Sun Belt	Steve Campbell	600000.0	600000.0	295000.0	0.0	918333.0	40646.0	NaN	NaN	NaN	NaN	2-10-0	16.7%	0.167	600.000
110	Texas-San Antonio	C-USA	Frank Wilson	1100000.0	1100000.0	185000.0	17500.0	3562500.0	65000.0	NaN	NaN	NaN	NaN	4-8-0	33.3%	0.333	1100.000

Since total pay was converted into numerical values, I looked for all schools where the total pay was listed as "0". The 4 schools with no listed total pay would be dropped from the set.

# Schools missing TotalPay
finalcoaches.loc[finalcoaches['TotalPay'] == 0]

	School	Conference	Coach	Capacity	Cohort Year	State	GSR	FGR	Win-Loss Record2019	WinPercentage2019	WinPer2019
12	Baylor	Big 12	Matt Rhule	45140.0	2006.0	TX	67.0	48.0	11-3-0	78.6%	0.786
16	Brigham Young	Ind.	Kalani Sitake	63725.0	2006.0	UT	57.0	42.0	7-6-0	53.8%	0.538
91	Rice	C-USA	Mike Bloomgren	47000.0	2006.0	TX	96.0	89.0	3-9-0	25.0%	0.250
99	Southern Methodist	AAC	Sonny Dykes	32000.0	2006.0	TX	75.0	63.0	10-3-0	76.9%	0.769

Lastly I looked at schools where the stadium capacity was not listed. Only one school was missing the capacity value, which was dropped from the dataset.

# Schools missing Stadium Capacity
finalcoaches[finalcoaches['Capacity'].isna()]

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	AssistantPay	Buyout	Capacity	Cohort Year	State	GSR	FGR	Win-Loss Record2019	WinPercentage2019	WinPer2019	TotalPayThousands
51	Liberty	Ind.	Turner Gill	947281.0	947281.0	0.0	0.0	0.0	0.0	NaN	2006.0	VA	65.0	54.0	8-5-0	61.5%	0.615	947.281

#Cleanup

# Drop all schools with 0 TotalPay
finalcoaches = finalcoaches.drop(finalcoaches.index[finalcoaches.TotalPay == 0])
# DROP GSR null
finalcoaches = finalcoaches.dropna(axis=0, subset=['GSR'])
# DROP missing Stadium Capacity
finalcoaches = finalcoaches.dropna(axis=0, subset=['Capacity'])

Power 5 Conferences

I also decided to add a column to our final dataframe indicating whether or not the school was a member of the "Power5" conferences. I suspected that these conferences had a significantly higher pay than the other conferences as they would want to recruit the top football coaches for their schools.

In college football, the term Power Five conferences refers to five athletic conferences whose members are part of the Football Bowl Subdivision (FBS) of NCAA Division I, the highest level of collegiate football in the United States. The conferences are: the Atlantic Coast Conference (ACC), Big Ten Conference, Big 12 Conference, Pac-12 Conference, and Southeastern Conference (SEC). The term "Power Five" is not defined by the National Collegiate Athletic Association (NCAA), and the origin of the term is unknown. It has been used in its current meaning since at least 2006.

Power Five Conferences

Descriptive Visualizations

First I decided to create a boxplot of total pay by conference as I suspected there were some major differences. In addition, I added "Power5" as a hue to see if there was a differene between the conferences belonging to the Power5 and the remaining conferences.

Our first observation is that the median total pay for the Power5 are higher than the remaining conferences. We see that their median amount is higher than the maximum for all other conferences.

# Create Boxplots By Conference
coachesbox = sns.boxplot(x="Conference",
                         y="TotalPayThousands",
                         data=finalcoaches,
                         hue="Power5")

#add title, xlabel and y label
plt.title('Coaches Total Pay By Conference', fontsize = 18)
plt.xlabel('Conferences')
plt.ylabel('Total Pay (Thousands)')
#rotate day of week (dow) names to reduce overlap
coachesbox.set_xticklabels(coachesbox.get_xticklabels(),rotation=45)
plt.show()

png

When arranged by median TotalPay, we see the top 5 schools are all members of the Power5, in ascending order Pac-12, ACC, Big12, SEC, Big10. I was expecting the SEC to have higher numbers than all the other conferences based on the strength of this conference. It had the highest maximum, minimum and upper quartile of all the conferences. It's lower quartile was lower than the Big 12's and Big Ten. It's medium is lower than the Big Ten's.

One thing to note is that the 6th school on the list (be median) is the AAC (The American), which was known as the Big East prior to the 2010-2014 NCAA Conference Realignment. The Big East was the part of the original sex college power confereces of the Bowl Championshop Series (BCS).

#https://stackoverflow.com/questions/21912634/how-can-i-sort-a-boxplot-in-pandas-by-the-median-values

def boxplot_sorted(df, by, column):
  df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
  meds = df2.median().sort_values(ascending=True)
  df2[meds.index].boxplot(rot=45)

boxplot_sorted(finalcoaches, by=["Conference"], column="TotalPayThousands")
plt.title('Coaches Total Pay By Conference (Arranged by Median)', fontsize = 18)
plt.xlabel('Conferences')
plt.ylabel('Total Pay (Thousands)')

png

When comparing Power5 vs Non-Power5, we see that there is a significant higher TotalPay for the Power5 conferences. The minimum TotalPay is lower than the maximum, however we see that all other measures of statistical distribution for the Power 5 schools are higher than the maximum Total Pay of the non-Power5 schools.

# Create Boxplots By Conference
coachesbox2 = sns.boxplot(x="Power5",
                         y="TotalPayThousands",
                         data=finalcoaches
                         )

#add title, xlabel and y label
plt.title('Coaches Total Power5 vs Non-Power5', fontsize = 18)
plt.xlabel('Conferences')
plt.ylabel('Total Pay (Thousands)')
#rotate day of week (dow) names to reduce overlap
coachesbox.set_xticklabels(coachesbox.get_xticklabels(),rotation=45)
plt.show()

png

I also created a correlation matrix to see what relationships we could find between the variables. I settled on the variables Capacity, GSR, WinPercentage2019(WinPer2019), and all the Conference variables. We see that Capacity has the highest correlation and has the biggest effect on Total Pay (positive). We also see that GSR, WinPer2019, as well as the 5 elite schools have a positive effect on Pay, with the SEC conference having the biggest effect. Notice that the non-Power5 all have a negative effect on pay.

# Check Correlation Matrix
coachescorr = finalcoaches.drop(['AssistantPay', 'Cohort Year', 'SchoolPay', 'Bonus', 'BonusPaid',
                                 'Buyout', 'TotalPayThousands', 'FGR'], axis=1)
# create corr df
corr = coachescorr.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Print heatmap
plt.figure(figsize=(12,8))
sns.heatmap(coachescorr.corr(), mask=mask,annot = True, vmin=-1, vmax=1, center= 0, cmap= 'coolwarm',
           linewidths=.5 #, linecolor='black'
           )                                             

png

I wanted to ignore the conferences for a second and look at just Capacity, GSR and Win Percentage to see what relationship existed between them and TotalPay. We see that there is definitely a positive relationship between TotalPay and Capacity but it is harder to tell with GSR and Win Percentage.

# Plot Total Pay vs Capacity, GSR, Win Percentage(2019)
# visualize the relationship between the features and the response using scatterplots
g=sns.pairplot(finalcoaches, x_vars=['Capacity','GSR','WinPer2019'],
             y_vars='TotalPay', height=4, aspect=1, kind='reg'
              )
#plt.title('Total Pay vs. Capacity, GSR, 2019 Win Percentage', y=4, fontsize = 16)
plt.subplots_adjust(top=.85)
g.fig.suptitle('Total Pay vs. Capacity, GSR, 2019 Win Percentage',fontsize = 18)

png

Although I was not looking at specific conferences, I still wanted to see if being a member of the Power5 played a role and plotted the points again with distinction between the two. When we use that distinction to color the points on the graph we see some patterns that weren’t as apparent in the previous plot.

In the Capacity graph, I notice that although there is an overall increase, the rate at which they occur is not the same across schools as evident by the slopes of the two lines. We see a similar scenario play out when it comes to WinPercentage. What this tells me, is that win percentage as well as stadium capacity have a larger effect on the Power5 football conference school’s total pay, when compared to the non-Power5 conference schools. For GSR, we see that the slopes are similar. We see that yes, there is an increase in TotalPay, albeit minor. In laymen’s terms what we are seeing is this, Winning percentage and Stadium size play a larger role in a Coach's pay than graduation rates do.

# Plot Total Pay vs Capacity, GSR, Win Percentage(2019)
# visualize the relationship between the features and the response using scatterplots
g=sns.pairplot(finalcoaches, x_vars=['Capacity','GSR','WinPer2019'], hue="Power5",
             y_vars='TotalPay', height=4, aspect=1, kind='reg'
              )
#plt.title('Total Pay vs. Capacity, GSR, 2019 Win Percentage', y=4, fontsize = 16)
plt.subplots_adjust(top=.85)
g.fig.suptitle('Total Pay vs. Capacity, GSR, 2019 Win Percentage',fontsize = 18)

png

I also replicated this graph using facet grids by conference. For the most part the faceted grids behaved similarly to the grouped versions above.

For Total Pay vs Capacity, most conferences had a positive relationship between the two, but we see steeper slopes for the Power5 schools. It was difficult to see the relatinship for Mt West, MAC and Pac-12

#print("Total Pay vs. Capacity by Conference")
g = sns.FacetGrid(finalcoaches, col="Conference", col_wrap=4, hue="Power5", )
g.map(plt.scatter, "Capacity","TotalPay", alpha=1);
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Total Pay vs. Capacity (by Conference)',fontsize = 18)

png

For Total Pay vs GSR, we also see positive relationships for most of the schools with there being a significant slope difference in the Power5. It was difficult to see a pattern for the Big Ten.

There was one particular pattern here that I thought was interesting. It seems that for the ACC and Big 12, as graduation rates increased the TotalPay actually went down which didnt match my inital expectations. I wondered if the reason for this had to do with the relationship between graduation rates and wins, which was slightly positive in the correlation matrix (0.18)? Maybe there were some schools that lost some of their good players due to academics and it impacted their winning percentage? Potentially, a losing record could lead to a coach replacement and maybe offering a higher salary to recruit a better coach?

#print("Total Pay vs. GSR by Conference")
g = sns.FacetGrid(finalcoaches, col="Conference", col_wrap=4, hue="Power5")
g.map(plt.scatter, "GSR","TotalPay", alpha=1);
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Total Pay vs. GSR (by Conference)',fontsize = 18)

png

For Total Pay vs WinPercentage, most conferences had a positive relationship between the two, but we see steeper slopes for the Power5 schools. It was difficult to see the relatinship for Pac-12

#print("Total Pay vs. Win Percentage by Conference")
g = sns.FacetGrid(finalcoaches, col="Conference", col_wrap=4, hue="Power5")
g.map(plt.scatter, "WinPer2019","TotalPay", alpha=1);
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Total Pay vs. 2019 Win Percentage (by Conference)',fontsize = 18)

png

Model Creation

I will create a model using the ordinarly least sums (OLS) method for simple lenear regression. First, let me create my test and training sets, 2/3 for training and 1/3 for testing.

# employ training-and-test regimen for model validation
np.random.seed(1234)
finalcoaches['runiform'] = uniform.rvs(loc = 0, scale = 1, size = len(finalcoaches))
finalcoaches_train = finalcoaches[finalcoaches['runiform'] >= 0.33]
finalcoaches_test = finalcoaches[finalcoaches['runiform'] < 0.33]

********************************************************************************

finalcoaches_train data frame (rows, columns):  (86, 31)
********************************************************************************
                  School Conference              Coach  ...  Big10  Big12  runiform
1                  Akron        MAC       Terry Bowden  ...      0      0  0.622109
2                Alabama        SEC         Nick Saban  ...      0      0  0.437728
3  Alabama at Birmingham      C-USA         Bill Clark  ...      0      0  0.785359
4      Appalachian State   Sun Belt  Scott Satterfield  ...      0      0  0.779976
7               Arkansas        SEC        Chad Morris  ...      0      0  0.801872

[5 rows x 31 columns]
********************************************************************************

finalcoaches_test data frame (rows, columns):  (34, 31)
********************************************************************************
             School Conference           Coach  ...  Big10  Big12  runiform
0         Air Force   Mt. West    Troy Calhoun  ...      0      0  0.191519
5           Arizona     Pac-12    Kevin Sumlin  ...      0      0  0.272593
6     Arizona State     Pac-12    Herm Edwards  ...      0      0  0.276464
19  Central Florida        AAC     Josh Heupel  ...      0      0  0.013768
25         Colorado     Pac-12  Mike MacIntyre  ...      0      0  0.075381

[5 rows x 31 columns]

Model 1: With Conference + Capacity + WinPer2019 + GSR

The first model I created would test the effect Conference, Stadium Capacity, 2019 WIn Percentage and Graduation Rate (GSR) had on the total pay.

# specify a simple model with Team Conference
conference_model = str('TotalPay ~ SEC + CUSA + SunBelt + Pac12 + Ind + ACC + AAC + Big10 + Big12 + MtWest + MAC + Capacity + WinPer2019 + GSR')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               TotalPay   R-squared:                       0.810
Model:                            OLS   Adj. R-squared:                  0.776
Method:                 Least Squares   F-statistic:                     23.60
Date:                Sat, 11 Apr 2020   Prob (F-statistic):           7.22e-21
Time:                        19:37:51   Log-Likelihood:                -1295.2
No. Observations:                  86   AIC:                             2618.
Df Residuals:                      72   BIC:                             2653.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -1.591e+06   7.73e+05     -2.059      0.043   -3.13e+06   -5.08e+04
SEC         1.175e+06   3.59e+05      3.271      0.002    4.59e+05    1.89e+06
CUSA       -8.397e+05   2.87e+05     -2.927      0.005   -1.41e+06   -2.68e+05
SunBelt    -6.237e+05   3.74e+05     -1.669      0.100   -1.37e+06    1.21e+05
Pac12       3.165e+05   3.23e+05      0.979      0.331   -3.28e+05    9.61e+05
Ind        -1.785e+06   6.74e+05     -2.649      0.010   -3.13e+06   -4.42e+05
ACC         4.487e+05   3.16e+05      1.422      0.159    -1.8e+05    1.08e+06
AAC        -4.176e+05   3.43e+05     -1.217      0.228    -1.1e+06    2.66e+05
Big10       6.661e+05   3.86e+05      1.726      0.089   -1.03e+05    1.44e+06
Big12       1.087e+06   5.19e+05      2.095      0.040    5.28e+04    2.12e+06
MtWest     -9.013e+05   3.24e+05     -2.778      0.007   -1.55e+06   -2.55e+05
MAC        -7.173e+05   3.18e+05     -2.255      0.027   -1.35e+06   -8.31e+04
Capacity      41.6164      7.175      5.800      0.000      27.314      55.919
WinPer2019  9.855e+05   5.38e+05      1.831      0.071   -8.74e+04    2.06e+06
GSR         2.083e+04   1.07e+04      1.944      0.056    -527.530    4.22e+04
==============================================================================
Omnibus:                        1.269   Durbin-Watson:                   1.745
Prob(Omnibus):                  0.530   Jarque-Bera (JB):                0.843
Skew:                           0.230   Prob(JB):                        0.656
Kurtosis:                       3.156   Cond. No.                     1.12e+18
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.12e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Model 2: Model 1 w/ no Conference

Model 2 would remove the conferences as variables to see if removing this variable had any significant difference.

# specify a simple model with Team Conference
elite_model = str('TotalPay ~ Capacity + WinPer2019 + GSR')

********************************************************************************
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               TotalPay   R-squared:                       0.722
Model:                            OLS   Adj. R-squared:                  0.711
Method:                 Least Squares   F-statistic:                     70.81
Date:                Sat, 11 Apr 2020   Prob (F-statistic):           1.08e-22
Time:                        19:37:51   Log-Likelihood:                -1311.6
No. Observations:                  86   AIC:                             2631.
Df Residuals:                      82   BIC:                             2641.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -2.49e+06    7.1e+05     -3.506      0.001    -3.9e+06   -1.08e+06
Capacity      65.4024      4.922     13.287      0.000      55.610      75.195
WinPer2019  2.604e+05   5.71e+05      0.456      0.649   -8.75e+05     1.4e+06
GSR         2.092e+04   1.01e+04      2.063      0.042     749.416    4.11e+04
==============================================================================
Omnibus:                        4.660   Durbin-Watson:                   2.057
Prob(Omnibus):                  0.097   Jarque-Bera (JB):                4.579
Skew:                          -0.304   Prob(JB):                        0.101
Kurtosis:                       3.953   Cond. No.                     3.58e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.58e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 3: Model 1 + State

Model 3 would add a state variable to model 1

# specify a simple model with Team Conference
state_model = str('TotalPay ~ Conference + Capacity + WinPer2019 + GSR + State')

********************************************************************************
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               TotalPay   R-squared:                       0.910
Model:                            OLS   Adj. R-squared:                  0.813
Method:                 Least Squares   F-statistic:                     9.407
Date:                Sat, 11 Apr 2020   Prob (F-statistic):           2.18e-11
Time:                        19:37:51   Log-Likelihood:                -1263.1
No. Observations:                  86   AIC:                             2616.
Df Residuals:                      41   BIC:                             2727.
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept              -1.837e+06   1.17e+06     -1.569      0.124    -4.2e+06    5.28e+05
Conference[T.ACC]       1.037e+06   5.94e+05      1.746      0.088   -1.63e+05    2.24e+06
Conference[T.Big 12]    9.164e+05   8.17e+05      1.121      0.269   -7.34e+05    2.57e+06
Conference[T.Big Ten]   1.494e+06   7.82e+05      1.910      0.063   -8.59e+04    3.07e+06
Conference[T.C-USA]    -9.452e+05   5.33e+05     -1.775      0.083   -2.02e+06     1.3e+05
Conference[T.Ind.]     -1.109e+06   1.02e+06     -1.085      0.284   -3.17e+06    9.56e+05
Conference[T.MAC]      -6.375e+05    6.6e+05     -0.966      0.340   -1.97e+06    6.95e+05
Conference[T.Mt. West] -1.117e+06   6.69e+05     -1.669      0.103   -2.47e+06    2.34e+05
Conference[T.Pac-12]   -4.127e+04    6.7e+05     -0.062      0.951   -1.39e+06    1.31e+06
Conference[T.SEC]       1.778e+06   6.12e+05      2.907      0.006    5.43e+05    3.01e+06
Conference[T.Sun Belt] -5.985e+05   5.86e+05     -1.021      0.313   -1.78e+06    5.85e+05
State[T.AR]            -9.926e+05   7.54e+05     -1.316      0.195   -2.52e+06     5.3e+05
State[T.AZ]             3.228e-10   2.02e-09      0.159      0.874   -3.77e-09    4.41e-09
State[T.CA]            -5.391e+05   4.11e+05     -1.312      0.197   -1.37e+06    2.91e+05
State[T.CO]             1.591e+05   8.79e+05      0.181      0.857   -1.62e+06    1.93e+06
State[T.CT]             -9.25e+05    1.1e+06     -0.839      0.406   -3.15e+06     1.3e+06
State[T.DC]             7.999e-10   3.61e-09      0.222      0.826   -6.49e-09    8.09e-09
State[T.DE]             7.933e-10   3.35e-09      0.237      0.814   -5.97e-09    7.55e-09
State[T.FL]            -3.868e+04   6.88e+05     -0.056      0.955   -1.43e+06    1.35e+06
State[T.GA]            -3.504e+05   8.28e+05     -0.423      0.674   -2.02e+06    1.32e+06
State[T.HI]              8.26e-10   1.79e-09      0.462      0.647   -2.79e-09    4.44e-09
State[T.IA]             3.631e-10   6.61e-10      0.549      0.586   -9.72e-10     1.7e-09
State[T.ID]            -1.039e+06   8.86e+05     -1.173      0.247   -2.83e+06    7.49e+05
State[T.IL]             1.132e+05   1.24e+06      0.091      0.928   -2.39e+06    2.61e+06
State[T.IN]            -1.766e+06   9.92e+05     -1.780      0.083   -3.77e+06    2.38e+05
State[T.KS]            -1.698e+05   1.23e+06     -0.138      0.891   -2.65e+06    2.31e+06
State[T.KY]            -8.232e+05   7.95e+05     -1.036      0.306   -2.43e+06    7.82e+05
State[T.LA]            -1.147e+06   7.14e+05     -1.606      0.116   -2.59e+06    2.95e+05
State[T.MA]            -2.138e+06   1.11e+06     -1.933      0.060   -4.37e+06    9.58e+04
State[T.MD]            -1.572e+06   9.61e+05     -1.636      0.109   -3.51e+06    3.68e+05
State[T.ME]             3.503e-10   1.72e-10      2.040      0.048    3.51e-12    6.97e-10
State[T.MI]             -3.03e+05   9.19e+05     -0.330      0.743   -2.16e+06    1.55e+06
State[T.MN]             1.342e-10   1.75e-10      0.766      0.448   -2.19e-10    4.88e-10
State[T.MO]             2.339e-10   2.58e-10      0.908      0.369   -2.86e-10    7.54e-10
State[T.MS]            -1.545e+06   6.88e+05     -2.247      0.030   -2.93e+06   -1.56e+05
State[T.MT]             3.608e-10   3.63e-10      0.993      0.327   -3.73e-10    1.09e-09
State[T.NC]            -1.587e+06   6.95e+05     -2.283      0.028   -2.99e+06   -1.83e+05
State[T.ND]             3.297e-11   2.26e-10      0.146      0.885   -4.23e-10    4.89e-10
State[T.NE]             8.483e-11   3.77e-10      0.225      0.823   -6.78e-10    8.47e-10
State[T.NH]             1.016e-10   3.47e-10      0.293      0.771      -6e-10    8.03e-10
State[T.NJ]            -2.442e+06   1.26e+06     -1.943      0.059   -4.98e+06    9.63e+04
State[T.NM]            -5.991e-11   1.43e-10     -0.420      0.676   -3.48e-10    2.28e-10
State[T.NV]            -6.012e+05   6.64e+05     -0.905      0.371   -1.94e+06    7.41e+05
State[T.NY]            -1.263e+06   9.17e+05     -1.377      0.176   -3.11e+06    5.89e+05
State[T.OH]             -5.86e+05   8.44e+05     -0.695      0.491   -2.29e+06    1.12e+06
State[T.OK]            -4.537e-10   2.79e-10     -1.626      0.112   -1.02e-09     1.1e-10
State[T.OR]            -8.948e+05   6.99e+05     -1.280      0.208   -2.31e+06    5.17e+05
State[T.PA]            -2.073e+06   8.94e+05     -2.319      0.025   -3.88e+06   -2.68e+05
State[T.RI]                     0          0        nan        nan           0           0
State[T.SC]            -3.648e+05   6.76e+05     -0.539      0.593   -1.73e+06       1e+06
State[T.SD]                     0          0        nan        nan           0           0
State[T.TN]            -1.559e+06   6.34e+05     -2.459      0.018   -2.84e+06   -2.78e+05
State[T.TX]             -4.94e+04   6.15e+05     -0.080      0.936   -1.29e+06    1.19e+06
State[T.UT]             5.027e+05   8.95e+05      0.562      0.577   -1.31e+06    2.31e+06
State[T.VA]             8.768e+05   1.05e+06      0.835      0.409   -1.24e+06       3e+06
State[T.WA]             1.013e+06   6.97e+05      1.454      0.153   -3.94e+05    2.42e+06
State[T.WI]             -2.09e+06   1.21e+06     -1.722      0.093   -4.54e+06    3.61e+05
State[T.WV]            -9.044e+05    9.9e+05     -0.913      0.366    -2.9e+06     1.1e+06
State[T.WY]             2.412e+05   8.52e+05      0.283      0.779   -1.48e+06    1.96e+06
Capacity                  32.6875      9.016      3.625      0.001      14.479      50.896
WinPer2019              1.375e+06   6.72e+05      2.047      0.047    1.84e+04    2.73e+06
GSR                     3.577e+04   1.24e+04      2.880      0.006    1.07e+04    6.09e+04
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.343
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                1.065
Skew:                           0.035   Prob(JB):                        0.587
Kurtosis:                       3.541   Cond. No.                     9.27e+20
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.12e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Model 4: Model 2 + State

Model 4 would add a state variable to Model 2

# specify a simple model with Team Conference
statenoconf_model = str('TotalPay ~ Capacity + WinPer2019 + GSR + State')

********************************************************************************
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               TotalPay   R-squared:                       0.833
Model:                            OLS   Adj. R-squared:                  0.717
Method:                 Least Squares   F-statistic:                     7.147
Date:                Sat, 11 Apr 2020   Prob (F-statistic):           3.20e-10
Time:                        19:37:51   Log-Likelihood:                -1289.5
No. Observations:                  86   AIC:                             2651.
Df Residuals:                      50   BIC:                             2739.
Df Model:                          35                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -3.032e+06   9.95e+05     -3.049      0.004   -5.03e+06   -1.03e+06
State[T.AR] -5.295e+05   9.11e+05     -0.581      0.564   -2.36e+06     1.3e+06
State[T.AZ]  4.821e-07   1.17e-06      0.413      0.681   -1.86e-06    2.83e-06
State[T.CA] -1.311e+06   6.74e+05     -1.944      0.058   -2.66e+06    4.35e+04
State[T.CO] -8.516e+05    1.2e+06     -0.708      0.482   -3.27e+06    1.56e+06
State[T.CT] -9.643e+05    1.2e+06     -0.803      0.426   -3.38e+06    1.45e+06
State[T.DC] -1.906e-06   4.25e-06     -0.448      0.656   -1.04e-05    6.64e-06
State[T.DE]  1.455e-05   3.25e-05      0.448      0.656   -5.07e-05    7.98e-05
State[T.FL]  1.131e+05    7.4e+05      0.153      0.879   -1.37e+06     1.6e+06
State[T.GA]  1.471e+05   9.19e+05      0.160      0.873    -1.7e+06    1.99e+06
State[T.HI]  7.475e-06   1.67e-05      0.448      0.656    -2.6e-05     4.1e-05
State[T.IA] -1.051e-06   2.35e-06     -0.448      0.656   -5.76e-06    3.66e-06
State[T.ID] -1.273e+06   1.25e+06     -1.021      0.312   -3.78e+06    1.23e+06
State[T.IL]  1.244e+06   1.17e+06      1.063      0.293   -1.11e+06     3.6e+06
State[T.IN] -1.688e+06   8.12e+05     -2.078      0.043   -3.32e+06   -5.63e+04
State[T.KS]  9.563e+05   1.16e+06      0.824      0.414   -1.37e+06    3.29e+06
State[T.KY]  1.145e+05   9.15e+05      0.125      0.901   -1.72e+06    1.95e+06
State[T.LA] -8.827e+05   8.41e+05     -1.050      0.299   -2.57e+06    8.06e+05
State[T.MA]  -9.71e+05   1.25e+06     -0.779      0.440   -3.48e+06    1.53e+06
State[T.MD] -5.403e+05   9.37e+05     -0.577      0.567   -2.42e+06    1.34e+06
State[T.ME] -3.224e-06   7.19e-06     -0.448      0.656   -1.77e-05    1.12e-05
State[T.MI] -8.719e+04   7.33e+05     -0.119      0.906   -1.56e+06    1.38e+06
State[T.MN]  1.524e-06    3.4e-06      0.448      0.656   -5.31e-06    8.36e-06
State[T.MO]  5.489e-09   1.22e-08      0.448      0.656   -1.91e-08    3.01e-08
State[T.MS] -7.809e+05   8.02e+05     -0.974      0.335   -2.39e+06    8.29e+05
State[T.MT]  4.571e-06   1.02e-05      0.448      0.656   -1.59e-05    2.51e-05
State[T.NC] -7.383e+05   7.04e+05     -1.048      0.300   -2.15e+06    6.77e+05
State[T.ND] -2.797e-08   6.36e-08     -0.440      0.662   -1.56e-07    9.97e-08
State[T.NE]  -3.05e-09      6e-09     -0.508      0.614   -1.51e-08    9.01e-09
State[T.NH] -6.276e-10   4.19e-10     -1.497      0.141   -1.47e-09    2.14e-10
State[T.NJ] -1.436e+06   1.24e+06     -1.156      0.253   -3.93e+06    1.06e+06
State[T.NM] -1.486e-09   3.61e-09     -0.412      0.682   -8.73e-09    5.76e-09
State[T.NV] -1.204e+06   9.28e+05     -1.298      0.200   -3.07e+06    6.59e+05
State[T.NY]  -1.13e+06   8.36e+05     -1.352      0.182   -2.81e+06    5.48e+05
State[T.OH] -3.944e+05   6.69e+05     -0.589      0.558   -1.74e+06     9.5e+05
State[T.OK] -3.423e-10   2.02e-10     -1.698      0.096   -7.47e-10    6.26e-11
State[T.OR] -7.126e+05   9.08e+05     -0.785      0.436   -2.54e+06    1.11e+06
State[T.PA] -1.674e+06    9.1e+05     -1.838      0.072    -3.5e+06    1.55e+05
State[T.RI] -3.488e-11   2.19e-11     -1.595      0.117   -7.88e-11    9.05e-12
State[T.SC]  2.014e+05   7.99e+05      0.252      0.802    -1.4e+06    1.81e+06
State[T.SD]          0          0        nan        nan           0           0
State[T.TN] -1.153e+06   7.53e+05     -1.532      0.132   -2.67e+06    3.59e+05
State[T.TX] -3.173e+05   6.75e+05     -0.470      0.640   -1.67e+06    1.04e+06
State[T.UT]  1.025e+06   1.18e+06      0.867      0.390   -1.35e+06     3.4e+06
State[T.VA]  4.709e+05   1.22e+06      0.385      0.702   -1.99e+06    2.93e+06
State[T.WA]  9.857e+05   9.03e+05      1.092      0.280   -8.27e+05     2.8e+06
State[T.WI] -1.286e+06   1.16e+06     -1.109      0.273   -3.61e+06    1.04e+06
State[T.WV] -1.258e+06   1.17e+06     -1.074      0.288   -3.61e+06    1.09e+06
State[T.WY]  9136.5108   1.18e+06      0.008      0.994   -2.36e+06    2.38e+06
Capacity       65.6679      5.839     11.247      0.000      53.941      77.395
WinPer2019   1.094e+05   6.92e+05      0.158      0.875   -1.28e+06     1.5e+06
GSR          3.772e+04   1.29e+04      2.928      0.005    1.18e+04    6.36e+04
==============================================================================
Omnibus:                       17.251   Durbin-Watson:                   2.446
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.275
Skew:                          -0.751   Prob(JB):                     1.62e-07
Kurtosis:                       5.544   Cond. No.                     3.47e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.22e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Analysis of Models

After running all 4 models I compared them in order to choose the model that had the best results. Model 3 had the highest r-squared values however it looks like the majority of the variables had high p-values and there was some correlation between the states and the conferences. I decided to go with Model 1 which has a lower r-square and more significant variables. With an R-squared value of .810, the model I selected should be 81% accurate.

# specify a simple model with Team Conference
conference_model = str('TotalPay ~ SEC + CUSA + SunBelt + Pac12 + Ind + ACC + AAC + Big10 + Big12 + MtWest + MAC + Capacity + WinPer2019 + GSR')

# fit the model to the training set
train_confmodel_fit = smf.ols(conference_model, data = finalcoaches_train).fit()

Using the SKlearn library, I trained and tested by data. Once our model was created we made predictins based on this model.

#linearRegression from SKlearn
lin_reg = LinearRegression()

# train
y_train = finalcoaches_train[['TotalPay']]
X_train = finalcoaches_train[['Capacity', 'GSR', 'WinPer2019', 'SEC', 'CUSA', 'SunBelt', 'Pac12', 'Ind', 'ACC',
                              'AAC', 'Big10', 'Big12', 'MtWest', 'MAC']]
lin_reg.fit(X_train, y_train)


#Predict
y_test = finalcoaches_test[['TotalPay']]
X_test = finalcoaches_test[['Capacity', 'GSR', 'WinPer2019', 'SEC', 'CUSA', 'SunBelt', 'Pac12', 'Ind', 'ACC',
                              'AAC', 'Big10', 'Big12', 'MtWest', 'MAC']]
y_pred = lin_reg.predict(X_test)

Expected Results

Lets look at the expected results

# Actual (Expected) Salary for Syracuse Coach
finalcoaches[finalcoaches['School'].str.match('Syracuse')]

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	AssistantPay	Buyout	Capacity	Cohort Year	State	GSR	FGR	Win-Loss Record2019	WinPercentage2019	WinPer2019	TotalPayThousands	Power5	MtWest	MAC	SEC	CUSA	SunBelt	Pac12	Ind	ACC	AAC	Big10	Big12	runiform
102	Syracuse	ACC	Dino Babers	2401206.0	2401206.0	0.0	0.0	0.0	0.0	49250.0	2006.0	NY	77.0	64.0	5-7-0	41.7%	0.417	2401.206	Y	0	0	0	0	0	0	0	1	0	0	0	0.480359

Predicted Syracuse Salary

Our first prediction was to look at what the predicted value should be for Syracuse University's football coach.

# predict syracuse salary
syr = finalcoaches[finalcoaches['School'] == 'Syracuse']
lin_reg.predict(syr[['Capacity', 'GSR', 'WinPer2019', 'SEC', 'CUSA', 'SunBelt', 'Pac12', 'Ind', 'ACC',
                              'AAC', 'Big10', 'Big12', 'MtWest', 'MAC']])

syrpred = lin_reg.predict(syr[['Capacity', 'GSR', 'WinPer2019', 'SEC', 'CUSA', 'SunBelt', 'Pac12', 'Ind', 'ACC',
                              'AAC', 'Big10', 'Big12', 'MtWest', 'MAC']])


# print amount
print("Syracuse Coach Pay should be ",(locale.currency( syrpred, grouping = True )))

Using the coefficients of the model, I came up with the new recommended TotalPay for the Syracuse University coach by plugging in the variables for the Syracuse data. Our model predicted $2,921.754.10 compared to the actual current pay of $2,401,206.

Syracuse Coach Pay should be  $2,921,754.10

Predicted Syracuse Salary (Big East Conference)

Next we predicted the coach's pay would if Syracuse was still part of the Big East (using No coefficiant). Using the coefficients of our model above, we came up with the new recommended TotalPay for the Syracuse University coach by plugging in the variables for the Syracuse data. Our model predicted $2,473,051.55 compared to the current $2,401,206.

# Predict Big East Pay (Using No Coefficiant)
print(train_confmodel_fit.params)
print(train_confmodel_fit.params[6]) #Coefficiant for ACC
print("*"*80)
acccoef = train_confmodel_fit.params[6]
print("Syracuse Coach Big East Pay should be ",(locale.currency( (syrpred -acccoef), grouping = True )))

Intercept    -1.591071e+06
SEC           1.175190e+06
CUSA         -8.396895e+05
SunBelt      -6.237114e+05
Pac12         3.165260e+05
Ind          -1.785427e+06
ACC           4.487025e+05
AAC          -4.176068e+05
Big10         6.661372e+05
Big12         1.087409e+06
MtWest       -9.012773e+05
MAC          -7.173237e+05
Capacity      4.161636e+01
WinPer2019    9.854898e+05
GSR           2.082555e+04
dtype: float64
448702.54375769565
********************************************************************************
Syracuse Coach Big East Pay should be  $2,473,051.55

Predicted Syracuse Salary (Big Ten Conference)

Lastly, we predicted what the coach's pay would be if Syracuse joined the Big Ten. In order to get this value, the big10 coefficient was added. The predicted total pay under this new Conference is $3,139,188.77

# Predict Big10 Pay
print(train_confmodel_fit.params)
print(train_confmodel_fit.params[8]) #Coefficiant for Big10
btencoef = train_confmodel_fit.params[8]
print("*"*80)
print("Syracuse Coach Big Ten Pay should be ",(locale.currency( ((syrpred -acccoef) + btencoef ), grouping = True )))

Intercept    -1.591071e+06
SEC           1.175190e+06
CUSA         -8.396895e+05
SunBelt      -6.237114e+05
Pac12         3.165260e+05
Ind          -1.785427e+06
ACC           4.487025e+05
AAC          -4.176068e+05
Big10         6.661372e+05
Big12         1.087409e+06
MtWest       -9.012773e+05
MAC          -7.173237e+05
Capacity      4.161636e+01
WinPer2019    9.854898e+05
GSR           2.082555e+04
dtype: float64
666137.2201025379
********************************************************************************
Syracuse Coach Big Ten Pay should be  $3,139,188.77

Results and Conclusions

The model produced in this experiment was supposed to have a high level of accuracy since it accounts for 83.3% of the variability. We were able to make a recommendation of a Total Pay of $2,921,754.10 for Syracuse University’s football coach based on the model created, which is significantly higher than his current Total Pay. I found it shocking to see that the recommended Big East Pay ($2,473,051.55) is higher than his current Total pay ($2,401,206), based on the fact that the Big East doesn’t even participate in football and is probably lacking the stadium capacity that the other schools have. Especially given the fact that stadium capacity has the largest effect on Total Pay (add $41.60 per seat). The recommended Big 10 Total Pay is $3,139,188.77 which is higher than the ACC pay as expected.

Due to the fact that there were significant differences in some of the conferences, I wonder if there should be a different model for each conference? Given that the conferences are small, there may not be enough data to support these models however we can see by the initial boxplot that there were definitely some differences.

Code

You can check out the full Python code using the following methods:

Github Page: Francisco’s Repository
Google Colab:

Share on

Twitter Facebook LinkedIn

Role

Skills Performed