Tuesday, April 14, 2015

Beginning R: selected multivariate statistics, part 1 of 3 (CCA)

Here's some more code from the University of Manitoba class in quantitative methods that I'm co-teaching with my post-doc mentor, Dr. Nicola Koper.  The class is now almost over, but I still have a few more code posts from it.  I gave the lecture for this class on multivariate statistics as well. 


#Start by setting your workspace and bringing in your data,
#and checking to make sure the variables are imported correctly
#with numbers as numeric, categories as factors, etc.
######################################################################################
setwd("C:\\Users\\YourUserName\\")
#Your directory here.
#Or use the menu...
#Session>Set Working Directory>To Source File Location [or you can choose the directory]

#Import your data.  (This has a few more variables but is otherwise the same as in previous posts.)
titmouse<- read.table(file="blog_banding_data_expanded.csv", header=TRUE, sep=",")
str(titmouse)
#shows the data types; always do this check.
titmouse$species<-titmouse$indexsum # create a new column with indexsum data copied.
#Create a categorical species variable.
titmouse$species<-sub(0,"Tufted",titmouse$species)
titmouse$species<-sub(6,"Black-crested",titmouse$species)
titmouse[titmouse$species>0&titmouse$species<6,]$species = "hybrid"
titmouse$species<-as.factor(titmouse$species)

#Our strategy for missing data will be to just eliminate NAs.
#so create a new dataframe with all the variables that we'll use from titmouse.
titmouse.na.omit<-data.frame(na.omit(titmouse[,c("wingchordr",
                                                 "taillength",
                                                 "mass",
                                                 "crestlength",
                                                 "billlength",
                                                 "billdepth",
                                                 "billwidth",
                                                 "indexsum",
                                                 "species",
                                                 "state",
                                                 "zone")]))


#Over in Environment we see that we have 87 rows of 11 variables.
#We could probably use up to 8 variables in our multivariate analysis.
######################################
#######Checking for assumptions#######
######################################

##Multivariate normality, collinearity, and homoscedasticity
#using scatterplots
#####
#scatterplots and density for univariate normality
pairs(~mass+wingchordr+taillength,
      data=titmouse.na.omit)

#another method that's a bit fancier.
library(car)
scatterplotMatrix(~mass+wingchordr+taillength|species,
#weird formula here, nothing as "response"/Y, and |indicates grouping variable.
                  data=titmouse.na.omit,
                  smoother=FALSE,
#default is TRUE, draws a nonparametric smoother.
                  reg.line=FALSE, #default is TRUE, draws a regression line.
                  diagonal="density", #puts a density kernel line on the diagonal for each group.
                  ellipse=TRUE,
                  levels=c(0.5,0.95),
#dataEllipse quartiles
                  pch=c(2,3,1), #rearrange the default points a bit
                  col=c(1,1,1)) #changed the colors to all black.

#All are oval-shaped data clouds with no huge skewness and no obvious increases in variance
#Do without groups as well.
scatterplotMatrix(~mass+wingchordr+taillength, #weird formula here, nothing as "response"/Y, and |indicates grouping variable.
                  data=titmouse.na.omit,
                  smoother=FALSE,
#default is TRUE, draws a nonparametric smoother.
                  reg.line=FALSE, #default is TRUE, draws a regression line.
                  diagonal="density", #puts a density kernel line on the diagonal for each variable.
                  # ellipse=TRUE,
                  # levels=c(0.5,0.95), #dataEllipse quartiles
                  pch=c(2,3,1), #rearrange the default points a bit
                  col=c(1,1,1)) #changed the colors to all black.



#Mahalanbois distances
#####
#Code adapted from various sources (and more information available at those links).

md.tcwm<-mahalanobis(titmouse.na.omit[,c("taillength","crestlength","wingchordr","mass")],
            colMeans(titmouse.na.omit[,c("taillength","crestlength","wingchordr","mass")]),
            cov(titmouse.na.omit[,c("taillength","crestlength","wingchordr","mass")]))

#pick out the variables we'll be using in some of the later analyses.
#You should do this for each analysis.

plot(density(md.tcwm),
     main="Squared Mahalanobis distances")
rug(md.tcwm)

#Next we'll get critical values, which are from the chi-squared distribution.
k<-4 #number of variables
critical<-qchisq(0.999, df = 4) #alpha=0.001 for this test (Tabachnick and Fidell 2007).

chi2<-qchisq(ppoints(length(titmouse.na.omit$taillength)), df = k)
#Generates the theoretical chi-squared values spaced along like the actual points.
#Do a qqplot.

qqplot.data<-qqplot(chi2,
       md.tcwm,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                           " vs. quantiles of" * ~ chi[4]^2), 
#Expression allows the fancy subscripts and greek letters.
     
       ylab=expression("Mahalanobis " * ~D^2),
       xlab=expression("Theoretical " * ~ chi[4]^2 * " quantiles"))
abline(0, 1, col = 'gray')
abline(h=critical, col="black")
text(x=4,
     y=critical+1,
     cex=0.5,
     labels=c("Critical value: outliers above this line"))

md.tcwm.outliers<-md.tcwm>critical
#generate true/false column for multivariate outlier status.
titmouse.na.omit$md.tcwm.outliers<-md.tcwm.outliers #add column to dataframe
titmouse.na.omit[titmouse.na.omit$md.tcwm.outliers==TRUE,] #bring up the outliers
hist(titmouse.na.omit$wingchordr) #Hmm, no huge gap between the small wing measurement and the rest of the data.  Probably just a runt?

######################################
#Canonical correlation analysis (CCA)#
#with bonus extra exciting thrills: learn more about using functions and lists!#
###########################################################################
#The CCA code is only slightly modified from a previous post on writing functions and using lists,
#but I'm including it again here so you can have an example with a real dataset
#instead of a randomly generated one.

#First test out canonical correlation.
vars<-as.matrix(cbind(titmouse.na.omit[,c("taillength","crestlength")]))
withs<-as.matrix(cbind(titmouse.na.omit[,c("mass","wingchordr")]))
results<-cancor(vars, withs)
results #view the results

#Okay, we get some results.
#But what about when we want to run this with different variables?
#It might be easier, if we're going to need to do the same set of commands
#repeatedly, to make a function.  Functions automate a process.

#For example, let's say we always want to add two variables (a and b).
#Every time, we could just do this:

a<-1
b<-2

c<-a+b

#But what about next time when we need a=2 and b=5?
#We'd have to re-set the variables or make new ones.
#It'd be easier to create an adding function.

adding<-function(a,b){  #name the function ('adding')
  #function with the variables you can input in parentheses (here a and b)
  #followed by a curly bracket.
  a+b #This tells you what the function will do with your input variables.
} #Close the curly brackets.

adding(a=2,b=5)
adding(a=3,b=2)
#This is a very simple example, but the concept holds with added complexity.
#Take a process and automate it.  Now we'll automate our CCA code from above.

canonical.corr.simple<-function (var, with, dataframe) {
  #function name and the variables we can input.
  #The next four lines are the same thing we did earlier to get CCA results,
  #but with the standardized variable names to match what's in the parentheses after function ().
  var.original<-as.matrix(cbind(dataframe[,var]))
  with.original<-as.matrix(cbind(dataframe[,with]))
  results.cancor<-cancor(var.original,with.original)
  results.cancor
}
#Now you can run it using input variables's names.

var.variables<-c("taillength", "crestlength")
with.variables<-c("mass", "wingchordr")

function.results<-canonical.corr.simple(var=var.variables,
                                        with=with.variables,
                                        dataframe=titmouse.na.omit)
function.results

function.results$cor #You can access different pieces of the function results this way.
function.results$cor[1]
#Use this list notation to get very specific pieces of the results.

#This is useful because now you can run the same function
#more than once using different input objects.

#However, you'll notice that if you try to access any objects
#internal to the function, you get an error.
#This is annoying if we want to keep our results,
#especially once we make a more complicated function where we need results
#that are not in the final printout.

var.original

#Error: object 'var.original' not found

#This means that the internal objects haven't been created in our workspace.
#To access them for later analysis, you'll need to return these values as a list.

canonical.corr.listreturn<-function (var, with, dataframe) {
  var.original<-as.matrix(cbind(dataframe[,var]))
  with.original<-as.matrix(cbind(dataframe[,with]))
  results.cancor<-cancor(var.original,with.original)
  #Create the list using the list function with your internal function objects.
  results.list<-list(results.cancor,
                     var.original,
                     with.original)
  #Use return to get it out of the function into the environment.
  return(results.list)
}


#Test out the function!
results.yeah<-canonical.corr.listreturn(var=var.variables,
                                        with=with.variables,
                                        dataframe=titmouse.na.omit)
#If you do not assign your results to an object,
#it will just print them and you can't access the list.
#Sometimes I forget this and try to ask for "results.list"
#in the function, but that won't work if you
#don't assign your results to an object like I did here.
#So then type out your object name.  It is now a list with three main elements.
results.yeah

#Yep.  Lots of info.  But how to we get individual pieces of the list?
#First it helps to know the structure of the list.
str(results.yeah)

#So, how do we get stuff out of this structure?

results.yeah[1]
#This is the whole first element.
#You can see its structure, too:
str(results.yeah[1])
#It has five elements, and more within each of those.
#To go farther into the list, use the square brackets.
#You need two around each except for the last element.
results.yeah[[1]][1]
#We got $cor!  Woohoo!  To get only the first $cor, try this:
results.yeah[[1]][[1]][1]
#Now just test out which pieces of the list you want using the data from str().
results.yeah[[1]][2]
#For example, the second element at this level is $xcoef.
#Sometimes I have to experiment with it a little to find what I want, but just remember
#it's got a structure, so you will be able to find what you need.

#Once you've messed around with that a bit to see
#how you can get to the various pieces you want,
#let's move on to making a more complete CCA function to use.
#This one will have spots for transformed and untransformed variables
#in case you need to transform any of your data and
#will output additional statistics to include in your work.

titmouse.na.omit.transforms<-titmouse.na.omit
titmouse.na.omit.transforms$lntaillength<-log(titmouse.na.omit$taillength)

#Remember, R uses the natural log.
#Now we have example data with one transformed variable.  (It didn't need it, just an example here.)

#Select the columns you want.
var.variables<-c("lntaillength","crestlength") #note I've selected one transformed variable.
with.variables<-c("mass","wingchordr")
#These are your variables.  You are correlating the with and var variate
#for the with and var variables.  That is, the two var variables will be combined into a single
#variate.  The same thing is done to the with variables (they are combined into a with variate
#that extracts as much variance as possible from the variables).

#Select columns for untransformed variables.
var.variables.untransformed<-c("taillength","crestlength")
with.variables.untransformed<-c("mass","wingchordr")
#note I've selected both untransformed variables.
#You need to do this so that you can get loadings of the variables on the variates
#for interpretation of the variates.

#Select any subsetting.  Here I will use data from either place1 or place2
#(all the data), but you can use this to select data from particular sites or groups.
subsetting.all<-c(titmouse.na.omit.transforms$zone=='old'|titmouse.na.omit.transforms$zone=='young')
#That vertical line is "or". 

#Make some variable names for when we get to graphing just to keep track.
graphing.variable.names<-c("var-variate= size proxies", "with-variate= feathers!")


#Make sure you have downloaded and installed the packages 'yacca' and 'CCP'
#if you don't already have them.
canonical.corr<- function (var,
                           with,
                           var.untransformed,
                           with.untransformed,
                           variablenames,
                           dataframe,
                           subsetting) {
 
 
  var.original<-as.matrix(cbind(dataframe[subsetting,var]))  #subsetting selects rows
  with.original<-as.matrix(cbind(dataframe[subsetting,with]))
  results.cancor<-cancor(var.original,with.original)
  rho<-cancor(var.original,with.original)$cor
  N<-dim(var.original)[1]
  p<-dim(var.original)[2]
  q<-dim(with.original)[2]
  #These four lines are needed for the below two, to get a Wilks' lambda.
  library(CCP)
  cancorresults<-p.asym(rho,N,p,q,tstat="Wilks")
 
  library(yacca)
  results.yacca<-cca(var.original,with.original)
  #easiest version of cancor!  gives correlations.
  #Use CCP library for additional test stats if needed, because cca() will only give Rao's
  #(related to and same value as Wilks).
 
  #To see structural correlations (loadings) for how X and Y variables change with variates,
  #get redundancy.  Interpret loadings that are |r| equal to or greater than 0.33.
  #If input variables are transformed, then need to do a Pearson's correlation on original variable
  #and appropriate canonical variate.
  var_expl<-cbind('x.redundancy'=results.yacca$xvrd, 'y.redundancy'=results.yacca$yvrd)
  #Canonical redundancies for x variables
  #(i.e., total fraction of x variance accounted for by y variables, through each canonical variate)
  #yvrd = Canonical redundancies for y variables (i.e., total fraction of y variance accounted for by x variables, through each canonical variate).
  #y is with, so y.redundancy is amount of with variance accounted for by var variance.
  #x is var, so x.redundancy is amount of var variance accounted for by with variance.
  var_expl #prints
 
  cancor.variates<-cbind('canonical.correlation'=results.yacca$corr,
                         #report this
                         'overlappingvariance.corrsq'=results.yacca$corrsq,
                         #canonical correlation with THIS % overlapping variance
                         'variance.in.x.from.x.cc'=results.yacca$xcanvad,
                         #amount of x variate accounted for by x variables
                         'variance.in.y.from.y.cc'=results.yacca$ycanvad
                         # amount of y variate accounted for by y variables
  )
 
 
  CC1x<-as.matrix(results.yacca$canvarx)
  CC1x
  CC1y<-as.matrix(results.yacca$canvary)
  with.loadings<-cor(CC1y,dataframe[subsetting,with.untransformed],method="pearson")
  # this gives feather length (y, with) loadings for untransformed data.
  #Used in interpreting graph because of transformation.
  var.loadings<-cor(CC1x,dataframe[subsetting,var.untransformed],method="pearson")
  # this gives size proxy (x, var) loadings for untransformed data.
  #is same as loadings in the original redundancy analysis because there was no transformation.
 
 
  Yplot<-data.frame(CC1x[,1])
  Xplot<-data.frame(CC1y[,1])
 
  graphing<-cbind(Xplot, Yplot) #This gives you data you can plot.
  colnames(graphing)<-variablenames #This gives you names of your variables.
  results.list<-list(cancorresults,
                     var_expl,
                     var.loadings,
                     with.loadings,
                     graphing,
                     cancor.variates)
 
  return(results.list) #Return gets this out of your function and creates the list.
}

#Now use that complicated function to put in your data.

long.results<-canonical.corr(var=var.variables,
                             with=with.variables,
                             var.untransformed=var.variables.untransformed,
                             with.untransformed=with.variables.untransformed,
                             variablenames=graphing.variable.names,
                             dataframe=titmouse.na.omit.transforms,
                             subsetting=subsetting.all)
long.results #look at our results here.  You can see the range of results we extracted.
str(long.results) #and the structure.


#First, is it significant?
long.results[[1]]$p.value[1] #This gets the first variate. Yup.
long.results[[1]]$p.value[2] #gets the second. Nope.
long.results[[1]]$p.value #gets all the variates' p-values.
long.results[[1]] #shows you all the degrees of freedom and Wilks' lambda test statistic that you need to report.

long.results[[2]] #shows the redundancy analysis.  How much of X is explained by y?
#This is what I commented in the function, but I'll repeat here:
#total fraction of y or x variance accounted for by x or y variables,
#through each canonical variate).
#y is with, so y.redundancy is amount of with variance accounted for by var variance.
#x is var, so x.redundancy is amount of var variance accounted for by with variance.

#Loadings for each variate.  Interpret loadings higher than |0.33| for significant variates.
long.results[[3]] #Loadings for var variables on each canonical variate (CV1 and CV2 in this case)
long.results[[4]] #Loadings for with variables on each canonical variate (CV1 and CV2 in this case)

#How about graphing?  Here's a very simple plot.  It uses the variable names we made earlier
#and plots CCI of each variate.
plot(long.results[[5]],
     pch=21, bg="black")
#It's better to plot with labels describing what each axis says (use the loadings),
#so modify your x- and y-axis labels.
plot(long.results[[5]],
     xlab="Wing chord (-0.99) and mass (-0.42) decrease",
     ylab="Tail length (-0.98) decreases")

long.results[[6]] #Finally some more interpretative stats.
#The first two are the canonical correlation
#and overlapping variance (correlation squared=the eigenvalue)).
#The last two are the variance explained by the variables within its own variate.
#variance.in.x.from.x.cc is how much of the "var" variables (x) is explained by its variate.
#Same for y ("with") variance.in.y.from.y.cc.
#Only CV1 was significant from the Wilks' lambda test so we do not interpret CV2.
#Body size and feather length variates were significantly correlated
#(Wilks' lambda=0.45, n=87, p<0.001; Canonical correlation=0.74, 55.3% overlapping variance).
#Loadings show that tail length (-0.98) decreases
#as wing chord (-0.99) and mass (-0.42) decrease.
#(This seems a counterintuitive way to label the axes but it's an arbitrary orientation.)
#The crest length loading is below the |0.33| cutoff at 0.28 so we do not interpret it.
#Wing chord and mass together explain 32.2% of the feather length variation;
#Feather length variation explains 28.5% of body size variation
#(but you'd assume that's probably not a causal relationship so it might not be appropriate to report).
#Wing chord and mass explain 51.6% of the variance in the body size variate,
#and tail length and crest length explain 58.2% of the variance in the feather length variate.

#Use this last function (canonical.corr) one to get the complete analysis.  The previous functions
#were to demonstrate how to develop a function.
#If you decide you need additional statistics for your CCA,
#you can read the help files for the CCA functions within it
#if you need to extract additional statistics for your particular analysis;
#just write them into the function and add them to the results list that is returned.

#That's all for CCA; next time we'll do discriminant function analysis (DFA) and multivariate ANOVA (MANOVA).

No comments:

Post a Comment

Comments and suggestions welcome.