ASD Screening Classification Project


CLASSIFICATION OF AUTISTIC SPECTRUM DISORDER SCREENING DATA

Autism Spectrum Disorder (ASD) refers to a range of conditions characterized by challenges with social skills, repetitive behaviors, speech and non-verbal communication According to the Centers for Disease Control, autism affects an estimated 1 in 59 children in the United States. In 2013 the American Psychiatric Association merged four distinct diagnoses into one under the name of Autism Spectrum Disorder, there were previously referred to as Autistic Disorder, Childhood Disintegrate Disorder, Pervasive Development Disorder-Not Otherwise Specified (PDD-NOS) and Asperger Syndrome. For a long time, ASD was associated with a lengthy process to get properly diagnosed. With an increase in the number of ASD cases around the world, a faster screening tool was created called the Autism Spectrum Quotient (AQ) consisting of 50 questions. A condensed version, the AQ-10, was created as a faster self-diagnosis tool to determine an individual’s position on the autism-normality spectrum. While the AQ-10 is NOT used for a definitive diagnosis, a score greater that 6/10 would be a flag that you should seek a professional diagnosis.

Objective
While the tool is a great first step, there aren’t many available datasets associated with clinical screenings and behavior. Per Fadi Fayez Thabtah, creator of the set, most available datasets on autism are genetic in nature. This set consists of the responses to behavior questions on the AQ-10 tool along with the results. In addition, 10 individual characteristics were made part of this set which have been used by the behavioral sciences experts for ASD detection. I intend to also use classification to determine the effectiveness of not just the AQ-10 tool but also the individual traits provided in diagnosing ASD by using classification and clustering algorithms. I will be using Decision Trees, Naïve Bayes, k-NN and Support Vector Machines.

Import Libraries & Data

Lets import the R packages we will be using

#----- Install Packages ----#
	if (!require(C50)) install.packages('C50') #C5.0 Decision Trees and Rule-Based Models
	if (!require(caret)) install.packages('caret') #Classification and Regression Training
	if (!require(countrycode)) install.packages('countrycode') #Convert Country Names and Country Codes
	if (!require(class)) install.packages('class')  #Functions for Classification
	if (!require(dplyr)) install.packages('dplyr') #for Data Manipulation
	if (!require(e1071)) install.packages('e1071') #naive Bayes classifier
	if (!require(farff)) install.packages('farff') #Reads and writes 'ARFF' files
	if (!require(gmodels)) install.packages('gmodels') #for Model Fitting
	if (!require(kernlab)) install.packages('kernlab') #Support Vector Machines
	if (!require(sqldf)) install.packages('sqldf') #Manipulate R Data Frames Using SQL

Next I downloaded and imported the data. The data consists of three separate data sets available from the UCI Machine Learning Repository. The sets are divided by age; children (4-11), adolescents (12-17), and adult (18 and over). The number of observations for each of the sets were 292 (children), 104 (adolescents), and 704 (adults) combining for 1100 total observations. The attributes were: ASD Diagnosis (our classifier), age (in years), gender, ethnicity, born with jaundice (Y/N), Family member with ASD (Y/N), Relation of the person completing the test, Country of Residence, Use of a screening application previously, age description, Questions 1-10 of the AQ tool (each separate), and the Screening Score on the AQ-10. The set was distributed as an ARFF file.

The sets can be downloaded directly via the links below:
ASD Screening Children Data Set
ASD Screening Adolescent Data Set
ASD Screening Adult Data Set

Data Preparation

There was some heavy amount of preparation that I needed to performed before I could analyze the data. Because each of the three datasets were small (child, adolescent and adult) my first step was to merge the three data sets into one larger set. There were several NA values that existed in the age column as well as in some of the factor columns (for example ethnicity). For the age NAs, I used the avg age in each of the primary sets (child adolescent, adult) to replace the NA value before merging into one set. For factor columns I created an “Unknown Value”. Several columns also needed to be recategorized to numeric value in order to be able to use them in the various algorithms.

Once I had one dataset, I did some additional cleaning of the data. The age column was eliminated since there was already a categorical “age description” column. Since research shows that ASD is not specific to country or ethnicity, I decided to remove those columns as well. Country of origin specifically had 89 different discrete values for this column, so this made the data easier to work with. The remaining values were converted into discrete values, I did this by creating separate variables for each columns option and using 1 and 0 to indicate Yes or No. Had I kept the countries and ethnicities; this would have resulted in over 100 additional variables.

# Sets for editing
	AutChild <- Autism_Child_Data
	AutTeen <- Autism_Adolescent_Data
	AutAdult <- Autism_Adult_Data

# Combining the three sets into one total set
	AutTotal <- rbind(AutChild,AutTeen,AutAdult)
	summary(AutTotal)
'data.frame':	1100 obs. of  21 variables:

Questionaire Data

Create Test and Training Sets

#----- Creating Training and Test Sets FOR ONLY Questionaire-----#
	Total1 <- AutTotal[,c(1, 8:18)]

# Create sets using caret library
	inTraining <- createDataPartition(Total1$ClassASD, times = 1, p = .66, list = FALSE)
	trainAut <- Total1[ inTraining,]
	trainAut$ClassASD <- droplevels( trainAut)$ClassASD
	testAut  <- Total1[-inTraining,]
	testAut$ClassASD <- droplevels(testAut)$ClassASD

# Predicting factor is a score >6 (results)

Build Model using c50 Algorithmn (c50 library)

I created a Decision Tree using the c50 library’s C5.0 algorithm. I first attempted to create a decision tree based on just the survey and final score. My intention was to see if there were any patterns with the answers, however I was surprised with the results. The resulting decision tree was completely based on the result of the survey, <=6 was no expected diagnosis of autism and >6 was expecting a Yes. Upon using the test data on the model, the model accurately predicted the outcome diagnosis 100% of the time. This raised a red flag as we should never expect 100% accuracy. All other subsequent classification algorithms also produced a 100% Accuracy in prediction, this caused me to want to NOT look at the questionnaire responses and instead look at all the other variables.

# building Model using c50 Algorithim (c50 library)
	dt_model <- C5.0(trainAut[-1], trainAut$ClassASD)

# evaluating performance
	pred_model <- predict(dt_model, testAut)
	CrossTable(testAut$ClassASD, pred_model, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )
Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD)

Classification Tree
Number of samples: 727
Number of predictors: 11    
Tree size: 2
Non-standard options: attempt to group attributes

Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD)

C5.0 [Release 2.07 GPL Edition]  	Tue Apr 28 16:49:55 2020
-------------------------------  
Class specified by attribute `outcome'
Read 727 cases (12 attributes) from undefined.data

Decision tree:
result <= 6: NO (467)
result > 6: YES (260)

Evaluation on training data (727 cases):   
	    Decision Tree   
	  ----------------  
	  Size      Errors      
	     2    0( 0.0%)   <<

	   (a)   (b)    <-classified as
	  ----  ----
	   467          (a): class NO
	         260    (b): class YES

	Attribute usage:
	100.00%	result         Time: 0.0 secs

   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|  

Total Observations in Table:  373

                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       240 |         0 |       240 |
                 |     0.643 |     0.000 |           |
-----------------|-----------|-----------|-----------|
             YES |         0 |       133 |       133 |
                 |     0.000 |     0.357 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       240 |       133 |       373 |
-----------------|-----------|-----------|-----------|

All Variables Data (Excluding Questionaire Questions)

I decided to create a new set for testing using all the existing variables in the set, with the exception of the questionairre responses.

Create Data Set

In order to do this I had to expand the existing categorical data and convert it into binary data. This was a lengthy process and is covered in detail in the code at the end of this project. This end result was the creation of the set AutExp2 consisting of 35 columns 1100 rows.

Create Test and Training Sets

#----- Creating Traing and Test Sets FOR ALL FACTORS (No quetionaire)-----#
	Total2 <- AutExp2[,c(1,19:35)]

# create sets using caret library
	inTraining <- createDataPartition(Total2$ClassASD, times = 1, p = .66, list = FALSE)
	trainAut <- Total2[ inTraining,]
	trainAut$ClassASD <- droplevels( trainAut)$ClassASD
	testAut  <- Total2[-inTraining,]
	testAut$ClassASD <- droplevels(testAut)$ClassASD

Build Model using c50 Algorithmn (c50 library)

I created a new model using all the other variables excluding the survey question and results. The decision tree model created had a probability of being accurate 67% of the time. Upon running my test set. The tree was able to accurately predict the diagnosis 68.63% (237/373) of the time. Below is the resulting decision tree model. As you can see, the attributes that contributed the most to this tree was the age description (12-17 age group) and the relationship of the person who filled out the form. It also seems that the algorithm is having problems correctly predicted “Yes ASD” outcomes. It correctly identified 80.83% of “No ASD” outcome but only correctly predicted 46.61% of “Yes ASD” outcome. Also, when attempting to boost, the number of trials stopped at 2 since the classifier was not accurate. There was no improvement in boosting. For this data, the decision tree is not a suitable tool for classification. There may possibly be additional variables not accounted for in this data that would aid in better classification via a Decision Tree.

# building Model using c50 Algorithim (c50 library)
	dt_model <- C5.0(trainAut[-1], trainAut$ClassASD)
	summary(dt_model)

# evaluating performance
	pred_model <- predict(dt_model, testAut)
	CrossTable(testAut$ClassASD, pred_model, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )

# improving model performance
	dt_boost10 <- C5.0(trainAut[-1], trainAut$ClassASD, trials= 10)
	dt_boost10 #use summary(dt_boost10) to get the % accuracy and tree
	summary(dt_boost10)

# evaluating boosted performance
	pred_model_boost  <- predict(dt_boost10, testAut)
	CrossTable(testAut$ClassASD, pred_model_boost, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )
Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD)

Classification Tree
Number of samples: 727
Number of predictors: 17
Tree size: 2

Non-standard options: attempt to group attributes

Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD)


C5.0 [Release 2.07 GPL Edition]  	Tue Apr 28 16:49:56 2020
-------------------------------    
Class specified by attribute `outcome'
Read 727 cases (18 attributes) from undefined.data

Decision tree:   
12to17 <= 0: NO (658/216)
12to17 > 0: YES (69/25)


Evaluation on training data (727 cases):   
	    Decision Tree   
	  ----------------  
	  Size      Errors     
	     2  241(33.1%)   <<  

	   (a)   (b)    <-classified as
	  ----  ----
	   442    25    (a): class NO
	   216    44    (b): class YES

	Attribute usage:
	100.00%	12to17       Time: 0.0 secs

   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

Total Observations in Table:  373


                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       224 |        16 |       240 |
                 |     0.601 |     0.043 |           |
-----------------|-----------|-----------|-----------|
             YES |       114 |        19 |       133 |
                 |     0.306 |     0.051 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       338 |        35 |       373 |
-----------------|-----------|-----------|-----------|


Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD, trials = 10)

Classification Tree
Number of samples: 727
Number of predictors: 17  
Number of boosting iterations: 10
Average tree size: 2.6

Non-standard options: attempt to group attributes


Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD, trials = 10)


C5.0 [Release 2.07 GPL Edition]  	Tue Apr 28 16:49:56 2020
-------------------------------  
Class specified by attribute `outcome'
Read 727 cases (18 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

12to17 <= 0: NO (658/216)
12to17 > 0: YES (69/25)

-----  Trial 1:  -----

Decision tree:

RelationUnkown > 0: NO (99.8/22.6)
RelationUnkown <= 0:
:...18Plus <= 0: YES (241.2/103.6)
    18Plus > 0: NO (386.1/149.2)

-----  Trial 2:  -----

Decision tree:

RelationUnkown > 0: NO (95.4/25.8)
RelationUnkown <= 0:
:...FamAutismN <= 0: YES (108.8/46.1)
    FamAutismN > 0: NO (522.8/232)

-----  Trial 3:  -----

Decision tree:

RelationUnkown > 0: NO (92.6/27.9)
RelationUnkown <= 0:
:...JundiceN <= 0: YES (98.4/42.8)
    JundiceN > 0:
    :...f <= 0: NO (299.3/129.8)
        f > 0: YES (236.7/109.9)

-----  Trial 4:  -----

Decision tree:

RelationParent <= 0: NO (509.8/221.6)
RelationParent > 0: YES (217.2/99.8)

-----  Trial 5:  -----

Decision tree:

RelationUnkown > 0: NO (88.4/30.9)
RelationUnkown <= 0:
:...12to17 <= 0: NO (570/275.7)
    12to17 > 0: YES (68.6/29.5)

-----  Trial 6:  -----

Decision tree:

RelationUnkown <= 0: YES (653.7/315.5)
RelationUnkown > 0: NO (55.3)

-----  Trial 7:  -----

Decision tree:

RelationUnkown > 0: NO (52.6)
RelationUnkown <= 0:
:...JundiceN <= 0: YES (101.1/47)
    JundiceN > 0: NO (554.4/267.3)

-----  Trial 8:  -----

Decision tree:

RelationUnkown > 0: NO (49.8)
RelationUnkown <= 0:
:...18Plus <= 0: YES (291.3/113.5)
    18Plus > 0: NO (268.9/61.1)

-----  Trial 9:  -----

Decision tree:
 NO (559/173.2)


Evaluation on training data (727 cases):

Trial	    Decision Tree   
-----	  ----------------  
	  Size      Errors  

   0	     2  241(33.1%)
   1	     3  246(33.8%)
   2	     3  260(35.8%)
   3	     4  297(40.9%)
   4	     2  260(35.8%)
   5	     3  238(32.7%)
   6	     2  398(54.7%)
   7	     3  260(35.8%)
   8	     3  246(33.8%)
   9	     1  260(35.8%)
boost	        236(32.5%)   <<


	   (a)   (b)    <-classified as
	  ----  ----
	   425    42    (a): class NO
	   194    66    (b): class YES


	Attribute usage:

	100.00%	RelationParent
	100.00%	RelationUnkown
	100.00%	12to17
	 85.56%	JundiceN
	 85.56%	FamAutismN
	 85.56%	18Plus
	 73.45%	f

Time: 0.0 secs



   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  373


                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       219 |        21 |       240 |
                 |     0.587 |     0.056 |           |
-----------------|-----------|-----------|-----------|
             YES |       104 |        29 |       133 |
                 |     0.279 |     0.078 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       323 |        50 |       373 |
-----------------|-----------|-----------|-----------|

Boost Model

# improving model performance
	dt_boost10 <- C5.0(trainAut[-1], trainAut$ClassASD, trials= 10)
	dt_boost10 #use summary(dt_boost10) to get the % accuracy and tree
	summary(dt_boost10)

# evaluating boosted performance
	pred_model_boost  <- predict(dt_boost10, testAut)
	CrossTable(testAut$ClassASD, pred_model_boost, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )
Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD, trials = 10)

Classification Tree
Number of samples: 727
Number of predictors: 17
Number of boosting iterations: 10
Average tree size: 2.6

Non-standard options: attempt to group attributes


Call:
C5.0.default(x = trainAut[-1], y = trainAut$ClassASD, trials = 10)


C5.0 [Release 2.07 GPL Edition]  	Tue Apr 28 16:49:56 2020
-------------------------------
Class specified by attribute `outcome'
Read 727 cases (18 attributes) from undefined.data

-----  Trial 0:  -----

Decision tree:

12to17 <= 0: NO (658/216)
12to17 > 0: YES (69/25)

-----  Trial 1:  -----

Decision tree:

RelationUnkown > 0: NO (99.8/22.6)
RelationUnkown <= 0:
:...18Plus <= 0: YES (241.2/103.6)
    18Plus > 0: NO (386.1/149.2)

-----  Trial 2:  -----

Decision tree:

RelationUnkown > 0: NO (95.4/25.8)
RelationUnkown <= 0:
:...FamAutismN <= 0: YES (108.8/46.1)
    FamAutismN > 0: NO (522.8/232)

-----  Trial 3:  -----

Decision tree:

RelationUnkown > 0: NO (92.6/27.9)
RelationUnkown <= 0:
:...JundiceN <= 0: YES (98.4/42.8)
    JundiceN > 0:
    :...f <= 0: NO (299.3/129.8)
        f > 0: YES (236.7/109.9)

-----  Trial 4:  -----

Decision tree:

RelationParent <= 0: NO (509.8/221.6)
RelationParent > 0: YES (217.2/99.8)

-----  Trial 5:  -----

Decision tree:

RelationUnkown > 0: NO (88.4/30.9)
RelationUnkown <= 0:
:...12to17 <= 0: NO (570/275.7)
    12to17 > 0: YES (68.6/29.5)

-----  Trial 6:  -----

Decision tree:

RelationUnkown <= 0: YES (653.7/315.5)
RelationUnkown > 0: NO (55.3)

-----  Trial 7:  -----

Decision tree:

RelationUnkown > 0: NO (52.6)
RelationUnkown <= 0:
:...JundiceN <= 0: YES (101.1/47)
    JundiceN > 0: NO (554.4/267.3)

-----  Trial 8:  -----

Decision tree:

RelationUnkown > 0: NO (49.8)
RelationUnkown <= 0:
:...18Plus <= 0: YES (291.3/113.5)
    18Plus > 0: NO (268.9/61.1)

-----  Trial 9:  -----

Decision tree:
 NO (559/173.2)


Evaluation on training data (727 cases):

Trial	    Decision Tree   
-----	  ----------------  
	  Size      Errors  

   0	     2  241(33.1%)
   1	     3  246(33.8%)
   2	     3  260(35.8%)
   3	     4  297(40.9%)
   4	     2  260(35.8%)
   5	     3  238(32.7%)
   6	     2  398(54.7%)
   7	     3  260(35.8%)
   8	     3  246(33.8%)
   9	     1  260(35.8%)
boost	        236(32.5%)   <<


	   (a)   (b)    <-classified as
	  ----  ----
	   425    42    (a): class NO
	   194    66    (b): class YES


	Attribute usage:

	100.00%	RelationParent
	100.00%	RelationUnkown
	100.00%	12to17
	 85.56%	JundiceN
	 85.56%	FamAutismN
	 85.56%	18Plus
	 73.45%	f
Time: 0.0 secs

   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  373


                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       219 |        21 |       240 |
                 |     0.587 |     0.056 |           |
-----------------|-----------|-----------|-----------|
             YES |       104 |        29 |       133 |
                 |     0.279 |     0.078 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       323 |        50 |       373 |
-----------------|-----------|-----------|-----------|

Build Naive Bayes Model

Next, I applied the Naïve Bayes algorithm via the e1071 package. The Naïve Bayes performed better than the Decision Tree, I ran improved model with a laplace value of 3. The Naïve Bayes model performed significantly better than the decision tree. Since the classifier is not looking for interdependence, it was able to do a better job of classifying the data.

#----- Use Na?ve Bayes Model-----#
# build the classifiers
	Autclassifier<- naiveBayes(trainAut, trainAut$ClassASD)
	Autclassifier

# evaluate model classfier
	Auttestpredicter <- predict(Autclassifier, testAut)

# compare predictions to true values
	CrossTable(testAut$ClassASD, Auttestpredicter, prop.chisq = FALSE, prop.t = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )

# create confusion matrix
	cMatrix <- table(Auttestpredicter, testAut$ClassASD)
	plot(cMatrix)
	confusionMatrix(cMatrix)
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = trainAut, y = trainAut$ClassASD)

A-priori probabilities:
trainAut$ClassASD
       NO       YES
0.6423659 0.3576341

Conditional probabilities:
                 ClassASD
trainAut$ClassASD NO YES
              NO   1   0
              YES  0   1

                 f
trainAut$ClassASD      [,1]      [,2]
              NO  0.4304069 0.4956641
              YES 0.4692308 0.5000148

                 m
trainAut$ClassASD      [,1]      [,2]
              NO  0.5695931 0.4956641
              YES 0.5307692 0.5000148

                 JundiceN
trainAut$ClassASD      [,1]      [,2]
              NO  0.8715203 0.3349820
              YES 0.8076923 0.3948736

                 JundiceY
trainAut$ClassASD      [,1]      [,2]
              NO  0.1284797 0.3349820
              YES 0.1923077 0.3948736

                 FamAutismN
trainAut$ClassASD      [,1]      [,2]
              NO  0.8822270 0.3226848
              YES 0.8038462 0.3978521

                 FamAutismY
trainAut$ClassASD      [,1]      [,2]
              NO  0.1177730 0.3226848
              YES 0.1961538 0.3978521

                 RelationHealthcarePro
trainAut$ClassASD       [,1]      [,2]
              NO  0.01070664 0.1030278
              YES 0.01923077 0.1376000

                 RelationOther
trainAut$ClassASD        [,1]       [,2]
              NO  0.004282655 0.06537174
              YES 0.007692308 0.08753632

                 RelationParent
trainAut$ClassASD      [,1]      [,2]
              NO  0.2141328 0.4106593
              YES 0.3846154 0.4874425

                 RelationRelative
trainAut$ClassASD       [,1]      [,2]
              NO  0.04710921 0.2120996
              YES 0.05384615 0.2261492

                 RelationSelf
trainAut$ClassASD      [,1]      [,2]
              NO  0.5374732 0.4991285
              YES 0.4653846 0.4997623

                 RelationUnkown
trainAut$ClassASD       [,1]      [,2]
              NO  0.18629550 0.3897625
              YES 0.06923077 0.2543357

                 UsedAppN
trainAut$ClassASD      [,1]      [,2]
              NO  0.9785867 0.1449128
              YES 0.9769231 0.1504374

                 UsedAppY
trainAut$ClassASD       [,1]      [,2]
              NO  0.02141328 0.1449128
              YES 0.02307692 0.1504374

                 12to17
trainAut$ClassASD       [,1]      [,2]
              NO  0.05353319 0.2253356
              YES 0.16923077 0.3756788

                 18Plus
trainAut$ClassASD      [,1]      [,2]
              NO  0.7194861 0.4497321
              YES 0.4807692 0.5005936

                 4to11
trainAut$ClassASD      [,1]      [,2]
              NO  0.2269807 0.4193292
              YES 0.3500000 0.4778895





   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|


Total Observations in Table:  373


                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       228 |        12 |       240 |
                 |     0.950 |     0.050 |     0.643 |
                 |     0.991 |     0.084 |           |
-----------------|-----------|-----------|-----------|
             YES |         2 |       131 |       133 |
                 |     0.015 |     0.985 |     0.357 |
                 |     0.009 |     0.916 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       230 |       143 |       373 |
                 |     0.617 |     0.383 |           |
-----------------|-----------|-----------|-----------|


Confusion Matrix and Statistics


Auttestpredicter  NO YES
             NO  228   2
             YES  12 131

               Accuracy : 0.9625          
                 95% CI : (0.9378, 0.9793)
    No Information Rate : 0.6434          
    P-Value [Acc > NIR] : < 2e-16         

                  Kappa : 0.9196          

 Mcnemar's Test P-Value : 0.01616         

            Sensitivity : 0.9500          
            Specificity : 0.9850          
         Pos Pred Value : 0.9913          
         Neg Pred Value : 0.9161          
             Prevalence : 0.6434          
         Detection Rate : 0.6113          
   Detection Prevalence : 0.6166          
      Balanced Accuracy : 0.9675          

       'Positive' Class : NO              

png

Improve Model

# build improved model
	Autclassifier2<- naiveBayes(trainAut, trainAut$ClassASD, laplace = 3)
	Autclassifier2

# evaluate improved model
	Auttestpredicter2 <- predict(Autclassifier2, testAut)

# compare improved predictions to true values
	CrossTable(testAut$ClassASD, Auttestpredicter2, prop.chisq = FALSE, prop.t = FALSE, dnn = c('actual diagnosis', 'predicted diagnosis') )
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = trainAut, y = trainAut$ClassASD, laplace = 3)

A-priori probabilities:
trainAut$ClassASD
       NO       YES
0.6423659 0.3576341

Conditional probabilities:
                 ClassASD
trainAut$ClassASD          NO         YES
              NO  0.993657505 0.006342495
              YES 0.011278195 0.988721805

                 f
trainAut$ClassASD      [,1]      [,2]
              NO  0.4304069 0.4956641
              YES 0.4692308 0.5000148

                 m
trainAut$ClassASD      [,1]      [,2]
              NO  0.5695931 0.4956641
              YES 0.5307692 0.5000148

                 JundiceN
trainAut$ClassASD      [,1]      [,2]
              NO  0.8715203 0.3349820
              YES 0.8076923 0.3948736

                 JundiceY
trainAut$ClassASD      [,1]      [,2]
              NO  0.1284797 0.3349820
              YES 0.1923077 0.3948736

                 FamAutismN
trainAut$ClassASD      [,1]      [,2]
              NO  0.8822270 0.3226848
              YES 0.8038462 0.3978521

                 FamAutismY
trainAut$ClassASD      [,1]      [,2]
              NO  0.1177730 0.3226848
              YES 0.1961538 0.3978521

                 RelationHealthcarePro
trainAut$ClassASD       [,1]      [,2]
              NO  0.01070664 0.1030278
              YES 0.01923077 0.1376000

                 RelationOther
trainAut$ClassASD        [,1]       [,2]
              NO  0.004282655 0.06537174
              YES 0.007692308 0.08753632

                 RelationParent
trainAut$ClassASD      [,1]      [,2]
              NO  0.2141328 0.4106593
              YES 0.3846154 0.4874425

                 RelationRelative
trainAut$ClassASD       [,1]      [,2]
              NO  0.04710921 0.2120996
              YES 0.05384615 0.2261492

                 RelationSelf
trainAut$ClassASD      [,1]      [,2]
              NO  0.5374732 0.4991285
              YES 0.4653846 0.4997623

                 RelationUnkown
trainAut$ClassASD       [,1]      [,2]
              NO  0.18629550 0.3897625
              YES 0.06923077 0.2543357

                 UsedAppN
trainAut$ClassASD      [,1]      [,2]
              NO  0.9785867 0.1449128
              YES 0.9769231 0.1504374

                 UsedAppY
trainAut$ClassASD       [,1]      [,2]
              NO  0.02141328 0.1449128
              YES 0.02307692 0.1504374

                 12to17
trainAut$ClassASD       [,1]      [,2]
              NO  0.05353319 0.2253356
              YES 0.16923077 0.3756788

                 18Plus
trainAut$ClassASD      [,1]      [,2]
              NO  0.7194861 0.4497321
              YES 0.4807692 0.5005936

                 4to11
trainAut$ClassASD      [,1]      [,2]
              NO  0.2269807 0.4193292
              YES 0.3500000 0.4778895

   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|


Total Observations in Table:  373


                 | predicted diagnosis
actual diagnosis |        NO |       YES | Row Total |
-----------------|-----------|-----------|-----------|
              NO |       223 |        17 |       240 |
                 |     0.929 |     0.071 |     0.643 |
                 |     0.978 |     0.117 |           |
-----------------|-----------|-----------|-----------|
             YES |         5 |       128 |       133 |
                 |     0.038 |     0.962 |     0.357 |
                 |     0.022 |     0.883 |           |
-----------------|-----------|-----------|-----------|
    Column Total |       228 |       145 |       373 |
                 |     0.611 |     0.389 |           |
-----------------|-----------|-----------|-----------|

Build k-NN Model

Next, I used the class package to apply the K-NN classification algorithm. The original unscaled version of the model I created only produced a 62.4% accuracy when tested. Like the Decision Tree Model, it has an issue with over reporting the “No-ASD” classification. I attempted to improve the model by rescaling using z-score standardization. This improved the accuracy slightly to 68.63%. The issue is still with the misclassification of the “Yes-ASD” outcome.

#----- Use k-NN Model-----#
# model training
	trainAutlabel <- trainAut[,1]
	testAutlabel <- testAut [,1]
	knnAutSpread <- knn(train=trainAut[,-1], testAut [,-1], cl=trainAutlabel, k=7)
	summary(knnAutSpread)
	CrossTable(x=testAutlabel, y=knnAutSpread, prop.chisq = FALSE)
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  373


             | knnAutSpread
testAutlabel |        NO |       YES | Row Total |
-------------|-----------|-----------|-----------|
          NO |       206 |        34 |       240 |
             |     0.858 |     0.142 |     0.643 |
             |     0.675 |     0.500 |           |
             |     0.552 |     0.091 |           |
-------------|-----------|-----------|-----------|
         YES |        99 |        34 |       133 |
             |     0.744 |     0.256 |     0.357 |
             |     0.325 |     0.500 |           |
             |     0.265 |     0.091 |           |
-------------|-----------|-----------|-----------|
Column Total |       305 |        68 |       373 |
             |     0.818 |     0.182 |           |
-------------|-----------|-----------|-----------|

Improve Performance

# improved performance using z-score standardization
# new model training
	Total2z <- as.data.frame(scale(Total2[,-1]))
	Total2z_train <- Total2z[1:733,]
	Total2z_test <- Total2z[734:1100,]
	Total2z_train_label <- Total2[1:733,1]
	Total2z_test_label <- Total2[734:1100,1]

# new performance
	knnAutPrediction <- knn(train=Total2z_train, Total2z_test, cl=Total2z_train_label, k=11)
	summary(knnAutPrediction)
	CrossTable(x=Total2z_test_label, y=knnAutPrediction, prop.chisq = FALSE)
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  367


                   | knnAutPrediction
Total2z_test_label |        NO |       YES | Row Total |
-------------------|-----------|-----------|-----------|
                NO |       247 |        26 |       273 |
                   |     0.905 |     0.095 |     0.744 |
                   |     0.744 |     0.743 |           |
                   |     0.673 |     0.071 |           |
-------------------|-----------|-----------|-----------|
               YES |        85 |         9 |        94 |
                   |     0.904 |     0.096 |     0.256 |
                   |     0.256 |     0.257 |           |
                   |     0.232 |     0.025 |           |
-------------------|-----------|-----------|-----------|
      Column Total |       332 |        35 |       367 |
                   |     0.905 |     0.095 |           |
-------------------|-----------|-----------|-----------|

Build SVM Model

Lastly, I attempted classify using the kernlab package to use the SVM algorithm. When running the model using the vanilladot (basic linear) kernel, the model had a 65.68% accuracy in predicting the outcome. I attempted to improve the model’s performance by using the rbfdot (Radial basis/Gaussian), however this resulted in less accuracy (64.87%)

#----- Use SVM Model-----#      
# build the model classifier
	ksvmmodelclassifier <- ksvm(ClassASD ~ ., data=trainAut, kernel ="vanilladot")
	ksvmmodelclassifier

# evaluate performance
	ksmvmodelpredictions  <- predict(ksvmmodelclassifier, testAut)
	CrossTable(testAut$ClassASD, ksmvmodelpredictions, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual number', 'predicted number') )
	# model predicted 236 / 373 (63.27%)
 Setting default kernel parameters  

Support Vector Machine object of class "ksvm"

SV type: C-svc  (classification)
 parameter : cost C = 1

Linear (vanilla) kernel function.

Number of Support Vectors : 505

Objective Function Value : -482.1433
Training error : 0.331499

   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  373


              | predicted number
actual number |        NO |       YES | Row Total |
--------------|-----------|-----------|-----------|
           NO |       224 |        16 |       240 |
              |     0.601 |     0.043 |           |
--------------|-----------|-----------|-----------|
          YES |       114 |        19 |       133 |
              |     0.306 |     0.051 |           |
--------------|-----------|-----------|-----------|
 Column Total |       338 |        35 |       373 |
--------------|-----------|-----------|-----------|

Improve Model

# improved performance
	ksvmmodelclassifierrbf <- ksvm(ClassASD ~ ., data=trainAut, kernel ="rbfdot")
	ksvmmodelclassifierrbf
	ksmvmodelpredictionsrbf  <- predict(ksvmmodelclassifierrbf, testAut)
	CrossTable(testAut$ClassASD, ksmvmodelpredictionsrbf, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual number', 'predicted number') )

# model predicted 247/ 373 (66.21%)
Support Vector Machine object of class "ksvm"

SV type: C-svc  (classification)
 parameter : cost C = 1

Gaussian Radial Basis kernel function.
 Hyperparameter : sigma =  0.0490338963559971

Number of Support Vectors : 506

Objective Function Value : -466.0458
Training error : 0.290234

   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  373


              | predicted number
actual number |        NO |       YES | Row Total |
--------------|-----------|-----------|-----------|
           NO |       210 |        30 |       240 |
              |     0.563 |     0.080 |           |
--------------|-----------|-----------|-----------|
          YES |        97 |        36 |       133 |
              |     0.260 |     0.097 |           |
--------------|-----------|-----------|-----------|
 Column Total |       307 |        66 |       373 |
--------------|-----------|-----------|-----------|

Results

I found the results very odd. It seems that when taking into account the AQ-10 responses and score, all the algorithms were able to classify. ASD outcome 100% of the time. When looking at all the behavioral factors, there was a steep decline in the accuracy, with the exception of Naïve Bayes.

png

Conclusion

So how is this possible? Is the AQ-10 really that good of a predicter? Are there more factors at play than the variables in this data set? I looked up a few research articles, to see what information was available regarding the AQ tool. A study in the Journal of Autism and Development Disorders concluded that there is little difference in results of the AQ-10 and AQ-50 and the AQ-10 can be a potentially useful screening tool (Booth, etal 2013). A later study in Psychological Medicine concluded that not were AQ scores not a significant predictor of diagnosis for ASD, but that the 64% of those who scored below the cut off and had a “No-ASD” outcome were actually false negatives who were later in fact diagnosed with ASD (Ashwood, etal, 2016). This led me to look at the provided data sets more. Upon further inspection, I noticed that the data was bias. All individuals with an AQ score of >=6 later received a No diagnosis. There were no signs of any false positives or negatives. This explains why there was a 100% accurate prediction when using using the AQ-10 responses and score.

With regards to the other factors, I believe there are two issues at play. First there needs to be more data as there isn’t a variance in results. Second, since ASD is still being studied I believe that there are several other variables at play which aren’t included in this data set, and may prove as better predictors to classify ASD. I believe more data should be collected, including expanding the variables being recorded.

References

Ashwood, K. L., Gillan, N., Horder, J., Hayward, H., Woodhouse, E., McEwen, F. S., . . . Murphy, D. G. (2016). Predicting the diagnosis of autism in adults using the autism-spectrum quotient (AQ) questionnaire. Psychological Medicine, 46(12), 2595-2604. doi:10.1017/S0033291716001082

Booth, T., Murray, A. L., McKenzie, K., Kuenssberg, R., O’Donnell, M., & Burnett, H. (2013). Brief report: An evaluation of the AQ-10 as a brief screening instrument for ASD in adults. Journal of Autism and Developmental Disorders, 43(12), 2997-3000. doi:10.1007/s10803-013-1844-5

Code

You can check out the full R code using the following methods:

  1. Github Page: Francisco’s Repository
  2. Google Colab: Open In Colab