In financial settings, this statistical method can be applied to understand whether two stocks/ portfolios behave similarly regarding returns and risks. However, most of the time, financial data are fat-tailed and highly auto-correlated as well as highly correlated with other stocks to be tested against, making the normal IID assumption hard to satisfy. To address these pitfalls, in this report, randomized paired methods and differencing are utilized to overcome the normal assumption and factor in the correlation. Overall, these methods will be more flexible and robust.
          In business settings, this statistical procedure is often applied to understand whether two groups of potential customers behave similarly, such as A/B testings. With this information, marketing managers can decide why and how to tailor different marketing campaigns to better cater to potential customers. However, it is often not obvious which kinds of tests one should use when confronted with various different constrains posed by the data at hand, such as non-normal distributions, high dimensions, small sample size, etc. Therefore, in this report, various tests are proposed and tuned and guidelines are provided when facing different constraint.
          We consider 4 different test statistics and we use permutation tests to obtain the significance level of the resulting statistics. In this way, we do not have to resort to normal or other specific assumptions and the resulting sampling distributions. During the tuning phase, Type 1 and Type 2 error are the average of 100 decisions and each decision is the result of 500 times of permutation tests.In application, 10,000 replicates are used to reach a decision.
          The efficacy of a statistical test depends on many factors. Listed below are the factors considered in this report. After understanding the advantages and disadvantages of the proposed tests, we apply them in the examination of the prostate cancer dataset. We consider 4 variables and are interested in understanding if there is an overall difference between people older than 65 and younger than 65.
          We discover that EBT and HTT tests are more powerful tests than the others in this particular setting. Therefore, We came to the conclusion that there is indeed an overall difference despite the fact that NNT and GST point in different direction.
# NNT
NNT <- function(z, ix, sizes, R) {
n1 <- sizes[1]
n2 <- sizes[2]
n <- n1 + n2
z <- z[ix, ]
NN <- nn2(z, z, k=R+1)
block1 <- NN$nn.idx[1:n1, -1]
block2 <- NN$nn.idx[(n1+1):n, -1]
i1 <- sum(block1 < n1 + .5)
i2 <- sum(block2 > n1 + .5)
return((i1 + i2) / (R * n))
}
NNT.perm <- function(z, sizes, R,alpha,B) {
T0=NNT(z,1:nrow(z),sizes, R)
TPerm=rep(0,B)
for (b in 1:B) {
permindex=sample(1:nrow(z))
TPerm[b]=NNT(z,permindex,sizes, R)
}
P <- mean(c(T0, TPerm) >= T0)
return(P<alpha)
}
EBT <- function(dst, ix, sizes) {
n1 <- sizes[1]
n2 <- sizes[2]
ii <- ix[1:n1]
jj <- ix[(n1+1):(n1+n2)]
w <- n1 * n2 / (n1 + n2)
m11 <- sum(dst[ii, ii]) / (n1 * n1)
m22 <- sum(dst[jj, jj]) / (n2 * n2)
m12 <- sum(dst[ii, jj]) / (n1 * n2)
e <- w * ((m12 + m12) - (m11 + m22))
return (e)
}
EBT.perm <- function(dst, sizes,alpha,B) {
T0=EBT(dst,1:nrow(dst),sizes)
TPerm=rep(0,B)
for (b in 1:B) {
permindex=sample(1:nrow(dst))
TPerm[b]=EBT(dst,permindex,sizes)
}
P <- mean(c(T0, TPerm) >= T0)
return(P<alpha)
}
# Hotelling's T
HTT<- function(z, ix, sizes){
n1 <- sizes[1]
n2 <- sizes[2]
z<- z[ix,,drop=F]
x.bar<- matrix(colSums(z[1:n1,,drop=F])/n1,nrow = 1,ncol = dim(z)[2])
y.bar<- matrix(colSums(z[(n1+1):(n1+n2),,drop=F])/n2,nrow = 1,ncol = dim(z)[2])
x.bar.matrix<- matrix(as.numeric(x.bar),nrow = n1,ncol = dim(z)[2],byrow = T)
y.bar.matrix<- matrix(as.numeric(y.bar),nrow = n2,ncol = dim(z)[2],byrow = T)
diff.x<- z[1:n1,,drop=F]-x.bar.matrix
diff.y<- z[(n1+1):(n1+n2),,drop=F]-y.bar.matrix
sigma.x.hat<- (1/(n1-1))*t(diff.x)%*%diff.x
sigma.y.hat<- (1/(n2-1))*t(diff.y)%*%diff.y
sigma.hat<- (1/(n1+n2-2))*((n1-1)*sigma.x.hat+(n2-1)*sigma.y.hat)
T.square<- ((n1*n2)/(n1+n2))*(x.bar-y.bar)%*%(solve(sigma.hat))%*%t(x.bar-y.bar)
return(T.square)
}
HTT.perm<- function(z, sizes,alpha,B){
T0<- HTT(z,1:nrow(z),sizes)
Tperm<- rep(0,B)
for (b in 1:B){
permindex<- sample(1:nrow(z))
Tperm[b]<- HTT(z,permindex,sizes)
}
P <- mean(c(T0, Tperm) >= as.numeric(T0))
return(P<alpha)
}
GST<- function(dst, ix, sizes, Q){
n1 <- sizes[1]
n2 <- sizes[2]
ii <- ix[1:n1]
jj <- ix[(n1+1):(n1+n2)]
n.edges.x<- sum(dst[ii,ii]<=Q)
n.edges.y<- sum(dst[jj,jj]<=Q)
n.edges.across<- sum(dst[ii,jj]<=Q)
total.edges<- n.edges.across*2+n.edges.x+n.edges.y
R<- (1/total.edges)*(n.edges.x+n.edges.y)
return(R)
}
GST.perm<- function(dst, sizes,Q,alpha,B){
T0<- GST(dst,1:nrow(dst),sizes,Q)
Tperm<- rep(0,B)
for (b in 1:B){
permindex<- sample(1:nrow(dst))
Tperm[b]<- GST(dst,permindex,sizes,Q)
}
P<-mean(c(T0,Tperm)>=T0)
return(P<alpha)
}Before comparing directly each tests for effectiveness, we note that NNT, EBT, and GST tests are open to customization, meaning that we can adjust some variables within the tests to better detect differences in data. Therefore, we start from adjusting these tests first.
(1). For the following results, we use the same size of samples for each group to simulate what we will encounter for testing differences:
First, we see whether we can control the Type 1 error, which is the risk that we take on when we make the decision that the effect of two treatments are not similar when in fact they are identical.
Secondly, in contrast, after controlling the Type 1 error, we observe how effective the tests are in detecting differences in treatment effects when in fact they are different. In this scenario, another risk, Type 2 error, incurs, which are the risk that we would fail to report the differences. Note we provide three cases to test these risks. Ranging from hardest to easiest to detect are in order 1, 2, 3.
(2). From the below graph, we successfully control the Type 1 error for each test under the dashed line when sample size at each group reaches 100. For GST, We report the best combination here. We conclude that the best customizations are NNT-K=3, EBT-P=2, and GST-P=2 & Q=3.
## nnt
n.neighbor<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
# d=2 & type 1 error
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor[j],alpha=0.05,B=500)
}
}
}
type.1.error<-tibble(n1=c(10,50,100),n2=c(10,50,100),`k=1`=0,`k=2`=0,`k=3`=0)
for (i in 1:3){
for (j in 1:3){
type.1.error[i,j+2]<- mean(decision[i,,j])
}
}
# d=2 & Type2 error & location 0.15
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m = n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.15), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor[j],alpha=0.05,B=500)
}
}
}
type.2.error.1<-tibble(n1=c(10,50,100),n2=c(10,50,100),`k=1`=0,`k=2`=0,`k=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.1[i,j+2]<- mean(1-decision[i,,j])
}
}
# d=2 & Type2 error & location 0.35
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.35), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor[j],alpha=0.05,B=500)
}
}
}
type.2.error.2<-tibble(n1=c(10,50,100),n2=c(10,50,100),`k=1`=0,`k=2`=0,`k=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.2[i,j+2]<- mean(1-decision[i,,j])
}
}
# d=2 & Type2 error & location 0.65
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.65), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor[j],alpha=0.05,B=500)
}
}
}
type.2.error.3<-tibble(n1=c(10,50,100),n2=c(10,50,100),`k=1`=0,`k=2`=0,`k=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.3[i,j+2]<- mean(1-decision[i,,j])
}
}
nnt.results<- cbind(type.1.error,type.2.error.1[,c(-1,-2)]
,type.2.error.2[,c(-1,-2)],type.2.error.3[,c(-1,-2)])
## EBT
L.p.distance<- function(x,p){
x<- as.matrix(x)
n<-dim(x)[1]
results<- matrix(0,nrow = n,ncol = n)
diag(results)<- 0
for (i in 1:(n-1)){
for (j in (i+1):n){
results[i,j]<- (sum((abs(x[i,]-x[j,]))^p))^(1/p)
}
}
results<- results+t(results)
return(results)
}
# d=2 & type 1 error
p<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
set.seed(670486091)
times = 200
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p[j])
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
type.1.error<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.1.error[i,j+2]<- mean(decision[i,,j])
}
}
# d=2 & type 2 error & location 0.15
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,0.15), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p[j])
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
type.2.error.1<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.1[i,j+2]<- mean(1-decision[i,,j])
}
}
# d=2 & type 2 error & location 0.35
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,0.35), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p[j])
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
type.2.error.2<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.2[i,j+2]<- mean(1-decision[i,,j])
}
}
# d=2 & type 2 error & location 0.65
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,0.65), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p[j])
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
type.2.error.3<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.3[i,j+2]<- mean(1-decision[i,,j])
}
}
EBT.Lp.results<- cbind(type.1.error,type.2.error.1[,c(-1,-2)]
,type.2.error.2[,c(-1,-2)],type.2.error.3[,c(-1,-2)])
## GST Lp distance Q=3 Type 1 error
P<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,P[j])
decision[i,t,j]=GST.perm(dst, c(n,m),Q=3,alpha=0.05,B=500)
}
}
}
type.1.error<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.1.error[i,j+2]<- mean(decision[i,,j])
}
}
## GST Lp distance Q=3 Type 2 error & location 0.15
P<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.15), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,P[j])
decision[i,t,j]=GST.perm(dst, c(n,m),Q=3,alpha=0.05,B=500)
}
}
}
type.2.error.1<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.1[i,j+2]<- mean(1-decision[i,,j])
}
}
## GST Lp distance Q=3 Type 2 error & location 0.35
P<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.35), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,P[j])
decision[i,t,j]=GST.perm(dst, c(n,m),Q=3,alpha=0.05,B=500)
}
}
}
type.2.error.2<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.2[i,j+2]<- mean(1-decision[i,,j])
}
}
## GST Lp distance Q=3 Type 2 error & location 0.65
P<- c(1,2,3)
n1<- c(10,50,100)
n2<- c(10,50,100)
location<- c(0,0.15,0.35,0.65)
dimension<-c(2,3,4)
set.seed(670486091)
times = 100
decision <- array(0,dim=c(3,times,3))
for (i in 1:3){
n = n1[i]
m=n2[i]
d=dimension[1] #bivariate case
for (j in 1:3){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = 0.65), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,P[j])
decision[i,t,j]=GST.perm(dst, c(n,m),Q=3,alpha=0.05,B=500)
}
}
}
type.2.error.3<-tibble(n1=c(10,50,100),n2=c(10,50,100),`P=1`=0,`P=2`=0,`P=3`=0)
for (i in 1:3){
for (j in 1:3){
type.2.error.3[i,j+2]<- mean(1-decision[i,,j])
}
}
GST.Q.3.results<- cbind(type.1.error,type.2.error.1[,c(-1,-2)]
,type.2.error.2[,c(-1,-2)],type.2.error.3[,c(-1,-2)])# NNT Plot
NNT.TYPE.1<- nnt.results %>% select(1,3:5)
ggplot(data = NNT.TYPE.1)+
geom_line(aes(as.factor(`n1`),`k=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`k=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`k=3`,group=1,color="royalblue4"))+
geom_hline(yintercept=0.05, linetype="dashed", color = "black")+
ggtitle("NNT Type 1 Error")+
labs(x="Sample Size of Each Group",y="")+
theme(axis.line = element_line(colour = "black"))+
theme(axis.title.x = element_text(size=8))+
scale_color_identity(name = "K-Neighbor",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("k=1", "k=2", "k=3"),
guide = "legend")
NNT.TYPE2.1<- nnt.results %>% select(1,6:8)
ggplot(data = NNT.TYPE2.1)+
geom_line(aes(as.factor(`n1`),`k=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`k=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`k=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "NNT Type 2 Error: Location 1")+
scale_color_identity(name = "K-Neighbor",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("k=1", "k=2", "k=3"),
guide = "legend")
NNT.TYPE2.2<- nnt.results %>% select(1,9:11)
ggplot(data = NNT.TYPE2.2)+
geom_line(aes(as.factor(`n1`),`k=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`k=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`k=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "NNT Type 2 Error: Location 2")+
scale_color_identity(name = "K-Neighbor",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("k=1", "k=2", "k=3"),
guide = "legend")
NNT.TYPE2.3<- nnt.results %>% select(1,12:14)
ggplot(data = NNT.TYPE2.3)+
geom_line(aes(as.factor(`n1`),`k=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`k=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`k=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "NNT Type 2 Error: Location 3")+
scale_color_identity(name = "K-Neighbor",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("k=1", "k=2", "k=3"),
guide = "legend")EBT.TYPE.1<- EBT.Lp.results %>% select(1,3:5)
ggplot(data = EBT.TYPE.1)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
geom_hline(yintercept=0.05, linetype="dashed", color = "black")+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "EBT Type 1 Error")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
EBT.TYPE2.1<- EBT.Lp.results %>% select(1,6:8)
ggplot(data = EBT.TYPE2.1)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "EBT Type 2 Error: Location 1")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
EBT.TYPE2.2<- EBT.Lp.results %>% select(1,9:11)
ggplot(data = EBT.TYPE2.2)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "EBT Type 2 Error: Location 2")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
EBT.TYPE2.3<- EBT.Lp.results %>% select(1,12:14)
ggplot(data = EBT.TYPE2.3)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "EBT Type 2 Error: Location 3")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("k=1", "k=2", "k=3"),
guide = "legend")GST.Q3.TYPE.1<- GST.Q.3.results %>% select(1,3:5)
ggplot(data = GST.Q3.TYPE.1)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
geom_hline(yintercept=0.05, linetype="dashed", color = "black")+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "GST-Q=3 Type 1 Error")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
GST.Q3.TYPE2.1<- GST.Q.3.results %>% select(1,6:8)
ggplot(data = GST.Q3.TYPE2.1)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "GST-Q=3 Type 2 Error: Location 1")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
GST.Q3.TYPE2.2<- GST.Q.3.results %>% select(1,9:11)
ggplot(data = GST.Q3.TYPE2.2)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "GST-Q=3 Type 2 Error: Location 2")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")
GST.Q3.TYPE2.3<- GST.Q.3.results %>% select(1,12:14)
ggplot(data = GST.Q3.TYPE2.3)+
geom_line(aes(as.factor(`n1`),`P=1`,group=1,color="firebrick3"))+
geom_line(aes(as.factor(`n1`),`P=2`,group=1,color="springgreen4"))+
geom_line(aes(as.factor(`n1`),`P=3`,group=1,color="royalblue4"))+
theme(axis.title.x = element_text(size=8))+
theme(axis.line = element_line(colour = "black"))+
labs(x="Sample Size of Each Group",y="",title = "GST-Q=3 Type 2 Error: Location 3")+
scale_color_identity(name = "Lp Distance",
breaks = c("firebrick3", "springgreen4", "royalblue4"),
labels = c("P=1", "P=2", "P=3"),
guide = "legend")Dimension: we use 100 samples for each group. We discover that Type 1 error cannot be controlled properly when dimension are higher than 5 for most of the tests. For type 2 error, EBT and HTT are effective for all dimension, while NNT and GST are only suitable for high dimension.
Sensitivity: we use different types of data to compare (Gamma), and discover that Type 1 error can be controlled except for GST. For Type 2 error, we find out that EBT and HTT are consistent, while NNT and GST are not effective even when the differences are obvious.
Efficiency: The lower the amount of time required, the more efficient the test is. We compare the execution time for running ten times of each test for dimension 2, 5, and 10. We find out that EBT and HTT are more efficient than the others, while NNT is computationally costly especially when dimension is high.
# NNT
n.neighbor<- 3
n1<- 100
n2<- 100
location<- c(0,0.15,0.35,0.65)
dimension<-c(1,2,5,10)
## for dimension=2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = location[j]), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor,alpha=0.05,B=500)
}
}
}
NNT.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
NNT.results[i,j]<- mean(decision[i,,j])
}else{
NNT.results[i,j]<- mean(1-decision[i,,j])
}
}
}
NNT.RESULTS<- as_tibble(NNT.results) %>% mutate(Dimension=c(1,2,5,10))
# EBT P=2
p<- 2
n1<- 100
n2<- 100
location<- c(0,0.15,0.35,0.65)
dimension<-c(1,2,5,10)
## for dimension=2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = location[j]), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p)
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
EBT.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
EBT.results[i,j]<- mean(decision[i,,j])
}else{
EBT.results[i,j]<- mean(1-decision[i,,j])
}
}
}
EBT.RESULTS<- as_tibble(EBT.results) %>% mutate(Dimension=c(1,2,5,10))
# HTT
n1<- 100
n2<- 100
location<- c(0,0.15,0.35,0.65)
dimension<-c(1,2,5,10)
## for dimension=2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = location[j]), m, d)
z <- rbind(x, y)
decision[i,t,j]=HTT.perm(z, c(n,m),alpha=0.05,B=500)
}
}
}
HTT.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
HTT.results[i,j]<- mean(decision[i,,j])
}else{
HTT.results[i,j]<- mean(1-decision[i,,j])
}
}
}
HTT.RESULTS<- as_tibble(HTT.results) %>% mutate(Dimension=c(1,2,5,10))
# GST test P=2 Q=3
p<- 2
Q<-3
n1<- 100
n2<- 100
location<- c(0,0.15,0.35,0.65)
dimension<-c(1,2,5,10)
## for dimension=2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rnorm(n*d), n, d)
y <- matrix(rnorm(m*d,mean = location[j]), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p)
decision[i,t,j]=GST.perm(dst, c(n,m),Q,alpha=0.05,B=500)
}
}
}
GST.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
GST.results[i,j]<- mean(decision[i,,j])
}else{
GST.results[i,j]<- mean(1-decision[i,,j])
}
}
}
GST.RESULTS<- as_tibble(GST.results) %>% mutate(Dimension=c(1,2,5,10))# Comparison
ggplot()+
geom_line(data=NNT.RESULTS,aes(as.factor(Dimension),V1,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS,aes(as.factor(Dimension),V1,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS,aes(as.factor(Dimension),V1,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS,aes(as.factor(Dimension),V1,group=1,color="goldenrod1"))+
geom_hline(yintercept=0.05, linetype="dashed", color = "black")+
theme(axis.line = element_line(colour = "black"))+
labs(x="Dimension",y="",title = "Type 1 Error - Normal(0,1)")+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS,aes(as.factor(Dimension),V2,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS,aes(as.factor(Dimension),V2,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS,aes(as.factor(Dimension),V2,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS,aes(as.factor(Dimension),V2,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 1")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS,aes(as.factor(Dimension),V3,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS,aes(as.factor(Dimension),V3,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS,aes(as.factor(Dimension),V3,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS,aes(as.factor(Dimension),V3,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 2")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS,aes(as.factor(Dimension),V4,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS,aes(as.factor(Dimension),V4,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS,aes(as.factor(Dimension),V4,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS,aes(as.factor(Dimension),V4,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 3")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")## NNT-Gamma
n.neighbor<- 3
n1<- 100
n2<- 100
shape<- c(0.5,0.55,0.6,0.65)
dimension<-c(1,2,5,10)
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rgamma(n*d,shape = 0.5,scale = 0.5), n, d)
y <- matrix(rgamma(m*d,shape = shape[j],scale = 0.5), m, d)
z <- rbind(x, y)
decision[i,t,j]=NNT.perm(z, c(n,m),R=n.neighbor,alpha=0.05,B=500)
}
}
}
NNT.shape.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
NNT.shape.results[i,j]<- mean(decision[i,,j])
}else{
NNT.shape.results[i,j]<- mean(1-decision[i,,j])
}
}
}
NNT.RESULTS.gamma.shape<- as_tibble(NNT.shape.results) %>% mutate(Dimension=c(1,2,5,10))
# EBT-Gamma
p<- 2
n1<- 100
n2<- 100
shape<- c(0.5,0.55,0.6,0.65)
dimension<-c(1,2,5,10)
## for dimension=1,2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rgamma(n*d,shape = 0.5,scale = 0.5), n, d)
y <- matrix(rgamma(m*d,shape = shape[j],scale = 0.5), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p)
decision[i,t,j]=EBT.perm(dst, c(n,m),alpha=0.05,B=500)
}
}
}
EBT.shape.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
EBT.shape.results[i,j]<- mean(decision[i,,j])
}else{
EBT.shape.results[i,j]<- mean(1-decision[i,,j])
}
}
}
EBT.RESULTS.gamma.shape<- as_tibble(EBT.shape.results) %>% mutate(Dimension=c(1,2,5,10))
## HTT-Gamma
n1<- 100
n2<- 100
shape<- c(0.5,0.55,0.6,0.65)
dimension<-c(1,2,5,10)
## for dimension=1,2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rgamma(n*d,shape = 0.5,scale = 0.5), n, d)
y <- matrix(rgamma(m*d,shape = shape[j],scale = 0.5), m, d)
z <- rbind(x, y)
decision[i,t,j]=HTT.perm(z, c(n,m),alpha=0.05,B=500)
}
}
}
HTT.shape.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
HTT.shape.results[i,j]<- mean(decision[i,,j])
}else{
HTT.shape.results[i,j]<- mean(1-decision[i,,j])
}
}
}
HTT.RESULTS.gamma.shape<- as_tibble(HTT.shape.results) %>% mutate(Dimension=c(1,2,5,10))
## GST-Gamma
p<- 2
Q<- 3
n1<- 100
n2<- 100
shape<- c(0.5,0.55,0.6,0.65)
dimension<-c(1,2,5,10)
## for dimension=1,2,5,10/ sample=100
set.seed(670486091)
times = 100
decision <- array(0,dim=c(4,times,4))
for (i in 1:4){
n = n1
m = n2
d=dimension[i]
for (j in 1:4){
for (t in 1:times) {
x <- matrix(rgamma(n*d,shape = 0.5,scale = 0.5), n, d)
y <- matrix(rgamma(m*d,shape = shape[j],scale = 0.5), m, d)
z <- rbind(x, y)
dst<- L.p.distance(z,p)
decision[i,t,j]=GST.perm(dst, c(n,m),Q,alpha=0.05,B=500)
}
}
}
GST.shape.results<- matrix(0,nrow = 4, ncol = 4)
for (j in 1:4){
for (i in 1:4){
if (j<=1){
GST.shape.results[i,j]<- mean(decision[i,,j])
}else{
GST.shape.results[i,j]<- mean(1-decision[i,,j])
}
}
}
GST.RESULTS.gamma.shape<- as_tibble(GST.shape.results) %>% mutate(Dimension=c(1,2,5,10))# Comparison-Gamma
ggplot()+
geom_line(data=NNT.RESULTS.gamma.shape,aes(as.factor(Dimension),V1,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS.gamma.shape,aes(as.factor(Dimension),V1,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS.gamma.shape,aes(as.factor(Dimension),V1,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS.gamma.shape,aes(as.factor(Dimension),V1,group=1,color="goldenrod1"))+
geom_hline(yintercept=0.05, linetype="dashed", color = "black")+
labs(x="Dimension",y="",title = "Type 1 Error - Gamma(0.5,0.5)")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS.gamma.shape,aes(as.factor(Dimension),V2,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS.gamma.shape,aes(as.factor(Dimension),V2,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS.gamma.shape,aes(as.factor(Dimension),V2,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS.gamma.shape,aes(as.factor(Dimension),V2,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 1")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS.gamma.shape,aes(as.factor(Dimension),V3,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS.gamma.shape,aes(as.factor(Dimension),V3,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS.gamma.shape,aes(as.factor(Dimension),V3,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS.gamma.shape,aes(as.factor(Dimension),V3,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 2")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")
ggplot()+
geom_line(data=NNT.RESULTS.gamma.shape,aes(as.factor(Dimension),V4,group=1,color="firebrick3"))+
geom_line(data=EBT.RESULTS.gamma.shape,aes(as.factor(Dimension),V4,group=1,color="springgreen4"))+
geom_line(data=HTT.RESULTS.gamma.shape,aes(as.factor(Dimension),V4,group=1,color="royalblue4"))+
geom_line(data=GST.RESULTS.gamma.shape,aes(as.factor(Dimension),V4,group=1,color="goldenrod1"))+
labs(x="Dimension",y="",title = "Type 2 Error: Location 3")+
theme(axis.line = element_line(colour = "black"))+
scale_color_identity(name = "Method",
breaks = c("firebrick3", "springgreen4","royalblue4","goldenrod1"),
labels = c("NNT", "EBT","HTT","GST"),
guide = "legend")# Efficiency
set.seed(670486091)
n1<- 100
n2<- 100
d<-2
x <- matrix(rnorm(n1*d), n1, d)
y <- matrix(rnorm(n2*d),n2, d)
z <- rbind(x, y)
dst<- L.p.distance(z,2)
results1<- benchmark(
columns = c("test", "replications", "elapsed", "relative"),
replications = 10,
Distance= dst<- L.p.distance(z,2),
EBT= EBT.perm(dst,c(n1,n2),alpha = 0.05,B=500),
HTT= HTT.perm(z,c(n1,n2),alpha = 0.05,B=500),
GST= GST.perm(dst,c(n1,n2),Q=3,alpha = 0.05,B=500),
NNT= NNT.perm(z,c(n1,n2),R=3,alpha = 0.05,B=500)
)
results1<- results1 %>% mutate(dimension=2)
results1[2,3]<- results1[2,3]+results1[1,3]
results1[3,3]<- results1[3,3]+results1[1,3]
results1<- results1[2:5,]
n1<- 100
n2<- 100
d<-5
x <- matrix(rnorm(n1*d), n1, d)
y <- matrix(rnorm(n2*d),n2, d)
z <- rbind(x, y)
dst<- L.p.distance(z,2)
results2<- benchmark(
columns = c("test", "replications", "elapsed", "relative"),
replications = 10,
Distance= dst<- L.p.distance(z,2),
EBT= EBT.perm(dst,c(n1,n2),alpha = 0.05,B=500),
HTT= HTT.perm(z,c(n1,n2),alpha = 0.05,B=500),
GST= GST.perm(dst,c(n1,n2),Q=3,alpha = 0.05,B=500),
NNT= NNT.perm(z,c(n1,n2),R=3,alpha = 0.05,B=500)
)
results2<- results2 %>% mutate(dimension=5)
results2[2,3]<- results2[2,3]+results2[1,3]
results2[3,3]<- results2[3,3]+results2[1,3]
results2<- results2[2:5,]
n1<- 100
n2<- 100
d<-10
x <- matrix(rnorm(n1*d), n1, d)
y <- matrix(rnorm(n2*d),n2, d)
z <- rbind(x, y)
dst<- L.p.distance(z,2)
results3<- benchmark(
columns = c("test", "replications", "elapsed", "relative"),
replications = 10,
Distance= dst<- L.p.distance(z,2),
EBT= EBT.perm(dst,c(n1,n2),alpha = 0.05,B=500),
HTT= HTT.perm(z,c(n1,n2),alpha = 0.05,B=500),
GST= GST.perm(dst,c(n1,n2),Q=3,alpha = 0.05,B=500),
NNT= NNT.perm(z,c(n1,n2),R=3,alpha = 0.05,B=500)
)
results3<- results3 %>% mutate(dimension=10)
results3[2,3]<- results3[2,3]+results3[1,3]
results3[3,3]<- results3[3,3]+results3[1,3]
results3<- results3[2:5,]
results<-rbind(results1,results2,results3)ggplot(data = results)+
theme(axis.line = element_line(colour = "black"))+
geom_point(aes(as.factor(dimension),elapsed,shape=test,color=test),size=3)+
labs(title = "Execution Time for Running 10 Times",y="Seconds",x="Number of Dimension")+
labs(color = "Method",shape="Method")+
scale_shape_discrete(limits = c("NNT", "EBT","HTT","GST"),labels = c("NNT", "EBT","HTT","GST")) +
scale_colour_discrete(limits = c("NNT", "EBT","HTT","GST"),labels = c("NNT", "EBT","HTT","GST"))+
scale_colour_manual(limits = c("NNT", "EBT","HTT","GST"), values=c("firebrick3","springgreen4","royalblue4","goldenrod1"))Â Â Â Â Â Â Â Â Â Â In this application, the goal is to evaluate two stock portfolios centered on diversified financial service banks. The overall returns and risks of 2 portfolios are evaluated through both high and low dimension parametric and non-parametric tests. The first portfolio consists of Morningstar 4-star rated diversified financial service conglomerates, and the second portfolio consists of 3-star of such conglomerates.
          Data sets are from yahoo finance of the past year(251 trading days in 2021). Returns are characterized by mean and median of the average 5-day-return(50 data points each) of the portfolio, while risks are characterized by variance, skewness and kurtosis of the average 5-day-return of the portfolio. Note that the returns are first logged and differenced to stabilize.
          Returns from two portfolios are paired by trading date to factor in the high linear and nonlinear relationships existed between stocks within the same industries. Hypothesis tests are executed through random shuffling of the paired returns to test whether they exhibit similar behaviors regarding returns and risks. Also, individual tests that gear toward each single factor are executed.
# Morningstar 4-star
citi<- read_csv("C:\\Users\\absur\\Desktop\\Morningstar 4 star\\C.csv")
credit.suisse<-read_csv("C:\\Users\\absur\\Desktop\\Morningstar 4 star\\CS.csv")
banco<- read_csv("C:\\Users\\absur\\Desktop\\Morningstar 4 star\\SAN.csv")
wbk<- read_csv("C:\\Users\\absur\\Desktop\\Morningstar 4 star\\WBK.csv")
morning4_log<- matrix(0,nrow = 251,ncol = 4)
name<-c("citi","credit.suisse","banco","wbk")
for (i in 1:4) {
morning4_log[,i]<-get(name[i]) %>% mutate(adj_close_log=log(`Adj Close`)) %>% pull(adj_close_log)
}
colnames(morning4_log)<-name
morning4_log<- as_tibble(morning4_log)
morning4_growth<- matrix(0,nrow = 250,ncol = 4)
for (i in 1:4) {
morning4_growth[,i]<- diff(pull(morning4_log,name[i]))
}
colnames(morning4_growth)<-name
morning4_growth<- as_tibble(morning4_growth)
morning4_growth<- morning4_growth %>% mutate(average_return=(citi+credit.suisse+banco+wbk)/4)
## Morningstar 3-star
jpm<- read_csv("C:\\Users\\absur\\Desktop\\Morningstar 3 star\\JPM.csv")
wfc<-read_csv("C:\\Users\\absur\\Desktop\\Morningstar 3 star\\WFC.csv")
ry<-read_csv("C:\\Users\\absur\\Desktop\\Morningstar 3 star\\RY.csv")
td<-read_csv("C:\\Users\\absur\\Desktop\\Morningstar 3 star\\TD.csv")
morning3_log<- matrix(0,nrow = 251,ncol = 4)
name<-c("jpm","wfc","ry","td")
for (i in 1:4) {
morning3_log[,i]<-get(name[i]) %>% mutate(adj_close_log=log(`Adj Close`)) %>% pull(adj_close_log)
}
colnames(morning3_log)<-name
morning3_log<- as_tibble(morning3_log)
morning3_growth<- matrix(0,nrow = 250,ncol = 4)
for (i in 1:4) {
morning3_growth[,i]<- diff(pull(morning3_log,name[i]))
}
colnames(morning3_growth)<-name
morning3_growth<- as_tibble(morning3_growth)
morning3_growth<- morning3_growth %>% mutate(average_return=(jpm+wfc+ry+td)/4)
# adding date
morning4_growth<- as_tibble(cbind(wfc$Date[2:251],morning4_growth)) %>% rename(Date=`wfc$Date[2:251]`)
morning3_growth<- as_tibble(cbind(wfc$Date[2:251],morning3_growth)) %>% rename(Date=`wfc$Date[2:251]`)datebreaks <- seq(as.Date("2021-01-01"), as.Date("2021-12-30"), by = "3 month")
datebreaks<-c(datebreaks,as.Date("2021-12-31"))
ggplot()+
geom_line(data = morning4_growth, aes(Date,average_return,colour="firebrick3"))+
geom_point(data = morning4_growth, aes(Date,average_return),colour="firebrick3")+
geom_line(data = morning3_growth,aes(Date,average_return,colour="royalblue4"))+
geom_point(data = morning3_growth,aes(Date,average_return),colour="firebrick3")+
labs(title = "Daily Average Return in 2021",x="",y="Average Return")+
scale_x_date(breaks = datebreaks,labels = date_format("%b"))+
annotate("text", x = as.Date("2021-05-15"), y = 0.040, label = "Trace Each Other Closely: Correlation=0.8",
family = "serif", fontface = "italic", colour = "black", size = 5)+
scale_color_identity(name = "Portfolio",
breaks = c("firebrick3", "royalblue4"),
labels = c("4-star", "3-star"),
guide = "legend")+
theme(axis.line = element_line(colour = "black"))+
theme(legend.position = "top")
ggplot()+
geom_density(data = morning4_growth,aes(average_return,y = ..density..),size=0.8,colour="firebrick3")+
geom_density(data= morning3_growth,aes(average_return,y = ..density..),size=0.8,colour="royalblue4")+
geom_histogram(data = morning4_growth,bins = 50,position = "identity", size = .3,alpha = 0.2,aes(x=average_return,y = ..density..,fill="firebrick3"))+
geom_histogram(data = morning3_growth,bins = 50,position = "identity", size = .3,alpha = 0.2,aes(x=average_return,y = ..density..,fill="royalblue4"))+
scale_fill_identity(name = "Portfolio",
breaks = c("firebrick3", "royalblue4"),
labels = c("4-star", "3-star"),
guide = "legend")+
labs(title = "Density Approximation of Daily Average Return",x="Average Return",y="Density")+
theme(legend.position = "top")+
theme(axis.line = element_line(colour = "black"))+
annotate("text", x = -0.03, y = 32, label = "3-star Return Seems Better",
family = "serif", fontface = "italic", colour = "black", size = 5)# 5-day Average & Corresponding statistics
morning3_moving<-matrix(0,nrow = 50, ncol = 5)
morning4_moving<- matrix(0,nrow = 50, ncol = 5)
index<- seq(1,255,5)
for (i in 1:50){
morning3_moving[i,1]<- mean(morning3_growth$average_return[index[i]:(index[i+1]-1)])
morning3_moving[i,2]<- median(morning3_growth$average_return[index[i]:(index[i+1]-1)])
morning3_moving[i,3]<- var(morning3_growth$average_return[index[i]:(index[i+1]-1)])
morning3_moving[i,4]<- skewness(morning3_growth$average_return[index[i]:(index[i+1]-1)],method = "sample")
morning3_moving[i,5]<- kurtosis(morning3_growth$average_return[index[i]:(index[i+1]-1)],method = "moment")
}
for (i in 1:50){
morning4_moving[i,1]<- mean(morning4_growth$average_return[index[i]:(index[i+1]-1)])
morning4_moving[i,2]<- median(morning4_growth$average_return[index[i]:(index[i+1]-1)])
morning4_moving[i,3]<- var(morning4_growth$average_return[index[i]:(index[i+1]-1)])
morning4_moving[i,4]<- skewness(morning4_growth$average_return[index[i]:(index[i+1]-1)],method = "sample")
morning4_moving[i,5]<- kurtosis(morning4_growth$average_return[index[i]:(index[i+1]-1)],method = "moment")
}
name<- c("mean","median","variance","skewness","kurtosis")
colnames(morning3_moving)<- name
colnames(morning4_moving)<- name
# Paired Comparison-Testing for overall difference of distribution
paired_shuffle<- function(z){
temp<- matrix(0,nrow = 100, ncol = ncol(z))
perm_index<- rbinom(50,1,0.5)
for (i in 1:50) {
if (perm_index[i]==1){
temp[i,]<- z[i+50,]
temp[i+50,]<-z[i,]
}else{
temp[i,]<- z[i,]
temp[i+50,]<-z[i+50,]
}
}
return(temp)
}
## NNT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
T0<- NNT(z,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z)
TPerm[b]<- NNT(temp,1:nrow(temp),c(50,50),3)
}
P_overall_NNT <- mean(c(T0,TPerm) >= T0)
## HTT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
T0<- as.numeric(HTT(z,1:nrow(z),c(50,50)))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z)
TPerm[b]<- as.numeric(HTT(temp,1:nrow(temp),c(50,50)))
}
P_overall_HTT <- mean(c(T0,TPerm) >= T0)
## EBT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
dst<- L.p.distance(z,2)
T0<- EBT(dst,1:nrow(z),c(50,50))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z)
dst<- L.p.distance(temp,2)
TPerm[b]<- EBT(dst,1:nrow(temp),c(50,50))
}
P_overall_EBT<- mean(c(T0,TPerm) >= T0)
## GST
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
dst<- L.p.distance(z,2)
T0<- GST(dst,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z)
dst<- L.p.distance(temp,2)
TPerm[b]<- GST(dst,1:nrow(temp),c(50,50),3)
}
P_overall_GST<- mean(c(T0,TPerm) >= T0)
P_value<- round(c(P_overall_NNT,P_overall_EBT,P_overall_HTT,P_overall_GST),2)
Decision<- c("Fail to Reject","Fail to Reject","Reject","Fail to Reject")
results<-as.data.frame(rbind(P_value,Decision))
colnames(results)<-c("NNT","EBT","HTT","GST")
rownames(results)<-c("P-value","Decision")
results# Testing for return
z<- rbind(morning4_moving,morning3_moving)
z_return<- z[,1:2]
dst1<-L.p.distance(z_return,2)
## NNT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_return<- z[,1:2]
T0<- NNT(z_return,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_return)
TPerm[b]<- NNT(temp,1:nrow(temp),c(50,50),3)
}
P_return_NNT <- mean(c(T0,TPerm) >= T0)
## HTT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_return<- z[,1:2]
T0<- as.numeric(HTT(z_return,1:nrow(z),c(50,50)))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_return)
TPerm[b]<- as.numeric(HTT(temp,1:nrow(temp),c(50,50)))
}
P_return_HTT<- mean(c(T0,TPerm) >= T0)
## EBT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_return<- z[,1:2]
dst1<- L.p.distance(z_return,2)
T0<- EBT(dst1,1:nrow(z),c(50,50))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_return)
dst<- L.p.distance(temp,2)
TPerm[b]<- EBT(dst,1:nrow(temp),c(50,50))
}
P_return_EBT<- mean(c(T0,TPerm) >= T0)
## GST
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_return<- z[,1:2]
dst1<- L.p.distance(z_return,2)
T0<- GST(dst1,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_return)
dst<- L.p.distance(temp,2)
TPerm[b]<- GST(dst,1:nrow(temp),c(50,50),3)
}
P_return_GST<- mean(c(T0,TPerm) >= T0)
P_value<- round(c(P_return_NNT,P_return_EBT,P_return_HTT,P_return_GST),2)
Decision<- c("Fail to Reject","Reject","Fail to Reject","Fail to Reject")
results<-as.data.frame(rbind(P_value,Decision))
colnames(results)<-c("NNT","EBT","HTT","GST")
rownames(results)<-c("P-value","Decision")
results# Testing for risks
z<- rbind(morning4_moving,morning3_moving)
z_risk<- z[,3:5]
dst2<-L.p.distance(z_risk,2)
## NNT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_risk<- z[,3:5]
T0<- NNT(z_risk,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_risk)
TPerm[b]<- NNT(temp,1:nrow(temp),c(50,50),3)
}
P_risk_NNT <- mean(c(T0,TPerm) >= T0)
## HTT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_risk<- z[,3:5]
T0<- as.numeric(HTT(z_risk,1:nrow(z),c(50,50)))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_risk)
TPerm[b]<- as.numeric(HTT(temp,1:nrow(temp),c(50,50)))
}
P_risk_HTT<- mean(c(T0,TPerm) >= T0)
## EBT
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_risk<- z[,3:5]
dst2<- L.p.distance(z_risk,2)
T0<- EBT(dst2,1:nrow(z),c(50,50))
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_risk)
dst<- L.p.distance(temp,2)
TPerm[b]<- EBT(dst,1:nrow(temp),c(50,50))
}
P_risk_EBT<- mean(c(T0,TPerm) >= T0)
## GST
B<- 10000
z<- rbind(morning4_moving,morning3_moving)
z_risk<- z[,3:5]
dst2<- L.p.distance(z_risk,2)
T0<- GST(dst2,1:nrow(z),c(50,50),3)
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_risk)
dst<- L.p.distance(temp,2)
TPerm[b]<- GST(dst,1:nrow(temp),c(50,50),3)
}
P_risk_GST<- mean(c(T0,TPerm) >= T0)
P_value<- round(c(P_risk_NNT,P_risk_EBT,P_risk_HTT,P_risk_GST),2)
Decision<- c("Fail to Reject","Fail to Reject","Reject","Fail to Reject")
results<-as.data.frame(rbind(P_value,Decision))
colnames(results)<-c("NNT","EBT","HTT","GST")
rownames(results)<-c("P-value","Decision")
results## Mean
z<- rbind(morning4_moving,morning3_moving)
z_mean<-z[,1,drop=F]
B<- 10000
T0<- median(z_mean[1:50]-z_mean[51:100])
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_mean)
TPerm[b]<- median(temp[1:50]-temp[51:100])
}
Diff1<-T0
P_single_mean <- mean(c(abs(T0), abs(TPerm)) >= abs(T0))
## Median
z<- rbind(morning4_moving,morning3_moving)
z_median<-z[,2,drop=F]
B<- 10000
T0<- mean(z_median[1:50,]-z_median[51:100,])
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_median)
TPerm[b]<- mean(temp[1:50]-temp[51:100])
}
Diff2<-T0
P_single_median <- mean(c(abs(T0), abs(TPerm)) >= abs(T0))
## variance
z<- rbind(morning4_moving,morning3_moving)
z_variance<-z[,3,drop=F]
B<- 10000
T0<- median(z_variance[1:50]-z_variance[51:100])
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_variance)
TPerm[b]<- median(temp[1:50]-temp[51:100])
}
Diff3<-T0
P_single_variance <- mean(c(abs(T0), abs(TPerm)) >= abs(T0))
## skewness
z<- rbind(morning4_moving,morning3_moving)
z_skewness<-z[,4,drop=F]
B<- 10000
T0<- median(z_skewness[1:50]-z_skewness[51:100]) #tail tend to be more on the right
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_skewness)
TPerm[b]<- median(temp[1:50]-temp[51:100])
}
Diff4<-T0
P_single_skewness <- mean(c(abs(T0), abs(TPerm)) >= abs(T0))
## kurtosis
z<- rbind(morning4_moving,morning3_moving)
z_kurtosis<-z[,5,drop=F]
B<- 10000
T0<- median(z_kurtosis[1:50]-z_kurtosis[51:100])
TPerm=rep(0,B)
for (b in 1:B){
temp<- paired_shuffle(z_kurtosis)
TPerm[b]<- median(temp[1:50]-temp[51:100])
}
Diff5<-T0
P_single_kurtosis<- mean(c(abs(T0), abs(TPerm)) >= abs(T0))
Difference<- round(c(Diff1,Diff2,Diff3,Diff4,Diff5),3)
P_value<- round(c(P_single_mean,P_single_median,P_single_variance,P_single_skewness,P_single_kurtosis),3)
Decision<-c("Reject","Fail to Reject","Fail to Reject","Fail to Reject","Fail to Reject")
results<- as.data.frame(rbind(Difference,P_value,Decision))
rownames(results)<-c("Difference","P-value","Decision")
colnames(results)<-c("Mean","Median","Variance","Skewness","Kurtosis")
resultsFor overall behavior of the two portfolios, the results suggest that there is no significant different. However, HTT suggests there is difference. HTT statistics are sensitive to mean and variance, and thus, it is important we evaluate these parameters individually as well.
Regarding returns, mean and median of 5-day average return are considered as a whole. Mean is much more sensitive to extreme value, while median is more robust. The test results are mixed but EBT, which are the most reliable tests among the four, concludes there is an overall difference. Thus, it should be noted that the behavior of return may be different individually.
Regarding risks, variance, skewness, and kurtosis are considered as a whole. The results suggest there is no overall difference. Note again HTT suggests there is difference, but the P-value is close to boundary, which indicates the decision may not be consistent.
To be prudent, individual tests targeting single factors are carried out. Median of the differences between each factor from the two portfolios is used as the statistics to measure differences. In this way, decisions would be more robust to extreme value. The results suggest that means are indeed different, but medians are not. This result provides an explanation of why the decisions are mixed regarding returns as a whole previously. In addition, each difference of risk factors are not significant, and thus, it would be safe to conclude that the overall risk factors are not different.
In conclusion, the mean of 5-day average return of 3-star portfolio is higher than that of 4-star portfolio by 0.2%, while the difference of risks are not significant. If the overall behavior/ distribution regarding return and risks is to be compared, the results suggest there is no significant different.
          In this application, we consider 4 variables (4 dimension). From the below density approximation plots, we see that means, the white dot, are quite close in value and also the general appearance of the density curves are similar. To understand the quantitative differences, we must perform statistical tests. With alpha level set at 0.05, we conclude the following:
pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
X=pros.data[pros.data$age>65, c('lpsa','lweight','lcp','lbph')]
Y=pros.data[pros.data$age<=65, c('lpsa','lweight','lcp','lbph')]
z<-rbind(X,Y)
pros.data## Violin Plot
pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
pros.data$group<-ifelse(pros.data$age>65,">65","<=65")
pros.data.plotting<- pros.data %>% select('lpsa','lweight','lcp','lbph','group')
pros.data.plotting<- pros.data.plotting %>% rename("Log PSA Score"=lpsa,
"Log Prostate Cancer Weight"=lweight,
"Log Capsular Penetration"=lcp,
"Log the Amount of Benign Prostatic Hyperplasia"=lbph)
pros.data.long<- pivot_longer(pros.data.plotting,c('Log PSA Score','Log Prostate Cancer Weight','Log Capsular Penetration','Log the Amount of Benign Prostatic Hyperplasia'),
names_to = "variable",values_to = "value")
ggplot(data = pros.data.long,aes(group,value)) +
geom_violin(trim = FALSE,adjust = .8,aes(colour=group))+
geom_boxplot(width = .03, fill = "black", outlier.colour = NA) +
stat_summary(fun = mean, geom = "point", fill = "white", shape = 21, size = 2)+
facet_wrap(~variable)+
labs(x="Age Group", y="")+
ggtitle("Men with Prostate Cancer - Comparisons of Key Variables")+
theme(strip.text = element_text(face = "bold",size = 8))+
theme(axis.text.x = element_text(colour = "black", size = rel(1.5)))+
scale_colour_manual(limits = c("<=65", ">65"), values=c("firebrick3","goldenrod1"))+
theme(legend.position = "none")NNT.perm(z,c(43,54),R=3,alpha=0.05,B=1000)## [1] FALSE
dst<- L.p.distance(z,2)
EBT.perm(dst,c(43,54),alpha = 0.05,B=1000)## [1] TRUE
HTT.perm(as.matrix(z),c(43,54),alpha = 0.05,B=1000)## [1] TRUE
GST.perm(dst,c(43,54),Q=3,alpha = 0.05,B=1000)## [1] FALSE
test<- data.frame(NNT="Fail to Reject",EBT="Reject",HTT="Reject",GST="Fail to Reject")
test