Elements of Statistics

—An Introduction to R and Rstudio

Yanxi Hou

School of Data Science, Fudan University

1 Introduction to R

1.1 Brief Introduction to R

  • R is a widely used tool in statistics, an open-source software of the GNU system, and an excellent tool for statistical calculations and plotting.

  • RStudio is R’s IDE. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, debugging, and workspace management.

1.2 Example Demonstration

Let’s first provide an example to use iris dataset for preliminary statistical analysis.

  • The library() function can load the package datasets which iris dataset bulit in.

  • The head() function Show the first six lines of iris data.

library(datasets) # Load built-in data sets
head(iris)    # Show the first six lines of iris data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

As we can see, the iris data set includes five dimensions which are Sepal.Length , Sepal.Width, Petal.Length ,Petal.Width and Species. We also can use View function to see the whole data set of iris.

  • The summary() function summarized the basic statistics for each dimension:
summary(iris) # Summary statistics for iris data
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
  • We also can draw the scatterplot for iris dataset with plot() function.
plot(iris)    # Scatterplot matrix for iris data

2 Variables in R

2.1 Variable in R

  • Variables are used to store data with named locations that your programs can manipulate;

  • A variable name can be a combination of letters, digits, period and underscore.

Valid variables Invalid variables
total tot@l
sum_data 5um
.count _count
.count.total TRUE
Var .0ar

We give some valid named variables.

# character string
hello_string <- 'hello'
hello_string
[1] "hello"
animal <- 'tiger'
animal
[1] "tiger"
# numeric variable
.pairs <- 100
.pairs
[1] 100
first_name <- 1.23
first_name
[1] 1.23
# complex variable
comp <- 1 +2i
comp 
[1] 1+2i
# vector
vector0 <- c(1,2,3,7,6)
vector0
[1] 1 2 3 7 6
vector_1 <-  1:5
vector_1
[1] 1 2 3 4 5
# string vector
vector2 <- c('red','yellow','green')
vector2
[1] "red"    "yellow" "green" 

We give some invalid named variables.

2pairs = 100
.2pairs = 100
first num = 100

3 Data Types in R

3.1 Data Types in R

  • We can use class() function to obtain the class of object which includes numeric, character, complex, integer etc.

  • Similarly, we can use typeof() function to return the type of an object.

Data Types Output
Logical TRUE,FALSE
Numeric 10.5,7,845
Integer 3L,40L,4L
Complex 3+2i
Character ‘a’,‘hello’,‘13.5’
Raw ‘Hello’ is stored as 48 65 6c 6c 6f

For example:

x <-  100
typeof(x)
[1] "double"
class(x)
[1] "numeric"
y <- 100L
typeof(y)
[1] "integer"
class(y)
[1] "integer"
a <- TRUE
typeof(a)
[1] "logical"
class(a)
[1] "logical"
stringhello <- 'hello world'
typeof(stringhello)
[1] "character"
class(stringhello)
[1] "character"

3.2 Arithmetic operators

R can do basic arithmetic operators, such as:

  • + — addition

  • - — subtraction

  • * — multiplication

  • / — division

  • %% — remainder

  • %/% — quotients

  • abs() — absolute

  • ^ — exponent

  • exp() — natural exponent

  • sqrt() — radical

  • log() — logarithmic

  • factorial() — factorial

  • sin(),cos(),tan() — trigonometric

  • choose() — combination

For examples:

add, subtract, multiply, divide and exponent
2+3*2;2^3;exp(2);2**3;(56-14)/6-4*7*10/(5^2-5);
[1] 8
[1] 8
[1] 7.389056
[1] 8
[1] -7
remainder,quotients
7%%2;7%/%2;9.5 %% (-2.7);9.5 %/% (-2.7)
[1] 1
[1] 3
[1] -1.3
[1] -4
radical,abs, trigonometric, logarithmic
sqrt(2);sqrt(8);abs(2-4);cos(4*pi);log(0)
[1] 1.414214
[1] 2.828427
[1] 2
[1] 1
[1] -Inf
factorial; choose
factorial(6);choose(5,2)
[1] 720
[1] 10

3.3 Logical operators

Logical operators mainly includes:

  • > — greater than

  • < — less than

  • >= — greater than/equal

  • <= — less than/equal

  • = — equal

  • != — not equal

  • & indicates And operation which returns TRUE if both the conditions are true.

  • | indicates or operation which returns TRUE if any one of the conditions results in TRUE.

  • ! indicates not operation which takes each element of the vector and gives the opposite logical value.

For example:

# numeric
x <- 100;y <- 200
y > x  # greater than
[1] TRUE
y < x  # less than
[1] FALSE
y >= x # greater than/equal
[1] TRUE
y <= x # less than/equal
[1] FALSE
x == y # equal to
[1] FALSE
x != y # not equal
[1] TRUE
# vector
x <- 1:5;y <- 2:6
y > x  # greater than
[1] TRUE TRUE TRUE TRUE TRUE
y < x  # less than
[1] FALSE FALSE FALSE FALSE FALSE
y >= x # greater than/equal
[1] TRUE TRUE TRUE TRUE TRUE
y <= x # less than/equal
[1] FALSE FALSE FALSE FALSE FALSE
x == y # equal to
[1] FALSE FALSE FALSE FALSE FALSE
x != y # not equal
[1] TRUE TRUE TRUE TRUE TRUE
10 > 20 & 10 < 20
[1] FALSE
20>20 & 10<20
[1] FALSE
20>=20 & 10<20
[1] TRUE
20>=20 & 10<20 & 20 <30
[1] TRUE
10 > 20 | 10 < 20
[1] TRUE
20>=20|10<20
[1] TRUE
10 > 20 | 10 < 2
[1] FALSE
10 > 20 | 10 < 2 | 10 > 15
[1] FALSE
!10==10
[1] FALSE
!(10==3)
[1] TRUE

These logical operations make data processing very convenient. For instance, we want get the following three data with student data set (Table 1.1-2 in textbook):

  • Male students with height greater than/equal 169;

  • Female students with age less than 14;

  • Students with age greater than/equal 16 or height greater than 168.

We can perform the following operations:

student <- read.csv("./students.csv")
head(student)
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1
# Male students with height greater than/equal 169
x <- student[student$Gender == "M" & student$Height >= 169,]
x
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
# Female students with age less than 14
y <- student[student$Gender == "F" & student$Age < 14,]
y
      Name Age Height Gender Weight
7   JACLYN  12    162      F   65.8
28  LOUISE  12    149      F   55.8
29   ALICE  13    149      F   48.6
33 BARBARA  13    147      F   50.8
35   KATIE  12    145      F   43.1
37   SUSAN  13    137      F   30.4
38    JANE  12    135      F   33.6
39  LILLIE  12    127      F   29.1
# Students with age greater than/equal 16 or height greater than 168
z <- student[student$Age >= 16 | student$Height > 168,]
z
       Name Age Height Gender Weight
1  LAWRENCE  17    172      M   78.1
2   JEFFERY  14    169      M   51.3
4   PHILLIP  16    167      M   58.1
5      KIRK  17    167      M   60.8
14   MARTHA  16    159      F   50.8
23    LINDA  17    152      F   52.7
31   MARIAN  16    147      F   52.2
Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

5 R Objects

5.1 Vectors

A vector is a sequence of data elements of the same basic type. We usually use the following ways to create a vector:

  • c() — define a vector;

  • vector() — initialize vector;

  • seq() — generate regular sequences;

  • rep() — replicate elements of vectors.

For example: c()

x <- c(0.5,0.6)##numeric
x
[1] 0.5 0.6
x <- c(TRUE,FALSE)##logical
x
[1]  TRUE FALSE
x <- c(T,F)##logical
x
[1]  TRUE FALSE
x <- c('a','b','c')##character
x
[1] "a" "b" "c"
x <- c('red','green','yellow')##character
x
[1] "red"    "green"  "yellow"
x <- c(1+0i,2+4i)##complex
x
[1] 1+0i 2+4i
x <- c(1,2,3,4,5)##integer
x
[1] 1 2 3 4 5
x <- 9:20##integer
x
 [1]  9 10 11 12 13 14 15 16 17 18 19 20

For example: vector()

x <- vector('numeric',length = 10)
x
 [1] 0 0 0 0 0 0 0 0 0 0
x <- vector('logical',length = 10)
x
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

For example: seq()

x <- seq(1,10, by = 1)
x
 [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(1,10, length.out = 10)
x
 [1]  1  2  3  4  5  6  7  8  9 10

For example: rep()

x <- rep(0, length.out = 10)
x
 [1] 0 0 0 0 0 0 0 0 0 0
x <- rep(NA, length.out = 10)
x
 [1] NA NA NA NA NA NA NA NA NA NA
  • class() — Identify what class this vector belongs to.

For examples:

# Integer vector
num <- 1:10
num
 [1]  1  2  3  4  5  6  7  8  9 10
class(num)
[1] "integer"
# Numeric vector, it has a float, 10.5
num <- c(1:10,10.5)
num
 [1]  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0 10.0 10.5
class(num)
[1] "numeric"
# Character vector
ltrs <- letters[1:10]
ltrs
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
class(ltrs)
[1] "character"
# Factor vector
fac <- as.factor(ltrs)
fac
 [1] a b c d e f g h i j
Levels: a b c d e f g h i j
class(fac)
[1] "factor"

If we merge two vectors with different types, they will eventually be unified into one type. For example:

#Create a vector of number
numbers <- c(1,2,3,4,5,6)
class(numbers)
[1] "numeric"
#Create a vector of letters
ltrs <- c('a','b','c','d')
class(ltrs)
[1] "character"
# concatenating the both above
mixed_vec <- c(numbers,ltrs)
# vector mixed_vec has coerced the numbers to character
print(mixed_vec)
 [1] "1" "2" "3" "4" "5" "6" "a" "b" "c" "d"
class(mixed_vec)
[1] "character"

As we can see, the numeric vector becomes characteristic vector.

  • Basic operations on vectors.

For examples:

x <- c(1,2,3,4)
y <- c(5,6,2,1)
# addition
x+y
[1] 6 8 5 5
# subtraction
y-x
[1]  4  4 -1 -3
# multiplication
x*y
[1]  5 12  6  4
# division
y/x
[1] 5.0000000 3.0000000 0.6666667 0.2500000
# exponent
y^x
[1]  5 36  8  1
# sum
sum(x)
[1] 10
# Cumulative sums
cumsum(x)
[1]  1  3  6 10
# mean
mean(x)
[1] 2.5
# variance
var(x)
[1] 1.666667
# standard variance
sd(x)
[1] 1.290994

Objects can be explicitly coerced from one class to another using the as. function.

x <- 0:6
class(x)
[1] "integer"
as.numeric(x)
[1] 0 1 2 3 4 5 6
as.logical(x)
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"

If R cannot figure out how to coerce an object, this can result in NAs being produced.

x <- c('a','b','c')
class(x)
[1] "character"
as.numeric(x)
[1] NA NA NA
as.logical(x)
[1] NA NA NA
as.complex(x)
[1] NA NA NA

R objects have attributes(metadata for object). Example of R object attributes (names, dimnames, dimensions[matrices,arrays], class[integer,numeric],length, other user-defined attributes/metadata).

Not all R objects contain attributes, in which case the attributes() function return NULL.

x <- c(1,2,3,5)
attributes(x)
NULL
y <- 1
attributes(y)
NULL

5.2 Matrix

  • matrix() — to create a Matrix;

    • nrow and ncol decide dim of a matrix;

    • byrow = T means to arrange by row while byrow = F by col.

For example:

  • Create a Matrix;
m <- matrix(1:9, nrow=3, ncol = 3, byrow = T)
m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
matrix(1:6, nrow = 2)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
matrix(1:6, nrow = 2,byrow = T)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
# The elements in a matrix may not necessarily be numerical.
m1 <- matrix(LETTERS[1:6],nrow = 4,ncol = 3)
m1
     [,1] [,2] [,3]
[1,] "A"  "E"  "C" 
[2,] "B"  "F"  "D" 
[3,] "C"  "A"  "E" 
[4,] "D"  "B"  "F" 
m2 <- matrix(c("复","旦","大","学"),nrow = 4,ncol = 4)
m2
     [,1] [,2] [,3] [,4]
[1,] "复" "复" "复" "复"
[2,] "旦" "旦" "旦" "旦"
[3,] "大" "大" "大" "大"
[4,] "学" "学" "学" "学"
# We can name rows and cols by 'rownames' and 'colnames'.
n = matrix(1:6,byrow = T,nrow = 2)
rownames(n) = c('row1','row2')
n
     [,1] [,2] [,3]
row1    1    2    3
row2    4    5    6
colnames(n) = c('col1','col2','col3')
n
     col1 col2 col3
row1    1    2    3
row2    4    5    6
  • cbind() can connect two or more vectors or matrices in columns to form a new matrix;

  • rbind() can connect two or more vectors or matrices in rows to form a new matrix.

cbind(1:3,1:3)
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
rbind(1:3,1:3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
rbind(n,7:9)
     col1 col2 col3
row1    1    2    3
row2    4    5    6
        7    8    9
cbind(n,c(10,11,12))
     col1 col2 col3   
row1    1    2    3 10
row2    4    5    6 11
  • Matrix mathematical operations.
A <- matrix(c(10,8,5,12), nrow = 2, byrow = TRUE)
A
     [,1] [,2]
[1,]   10    8
[2,]    5   12
B <- matrix(c(5,3,15,16), nrow = 2, byrow = TRUE)
B
     [,1] [,2]
[1,]    5    3
[2,]   15   16
# Dimension
dim(A)
[1] 2 2
# Addition
A+B
     [,1] [,2]
[1,]   15   11
[2,]   20   28
# Subtraction
A-B
     [,1] [,2]
[1,]    5    5
[2,]  -10   -4
# Inner product
A*B
     [,1] [,2]
[1,]   50   24
[2,]   75  192
# Inverse matrix
solve(A)
        [,1]   [,2]
[1,]  0.1500 -0.100
[2,] -0.0625  0.125
# Diagonal element
diag(A)
[1] 10 12
x <- matrix(1:6, nrow = 2)
x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
y <- matrix(10:15, nrow = 3)
y
     [,1] [,2]
[1,]   10   13
[2,]   11   14
[3,]   12   15
# Transpose
t(x)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
# Multiplication
x %*% y
     [,1] [,2]
[1,]  103  130
[2,]  136  172
z <- y %*% x
z
     [,1] [,2] [,3]
[1,]   36   82  128
[2,]   39   89  139
[3,]   42   96  150
# Determinant
det(z)
[1] 0
# Eigenvalues and eigenvectors
eigen(z)
eigen() decomposition
$values
[1] 2.748690e+02 1.309715e-01 4.458832e-15

$vectors
           [,1]       [,2]       [,3]
[1,] -0.5308239 -0.8834752  0.4082483
[2,] -0.5761623 -0.2408149 -0.8164966
[3,] -0.6215006  0.4018454  0.4082483
  • apply() functional family in matrix operations.

    • MARGIN = 1 indicates calculation by row;

    • MARGIN = 2 indicates calculation by column.

X <- matrix(1:30, nrow = 5)
X
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6   11   16   21   26
[2,]    2    7   12   17   22   27
[3,]    3    8   13   18   23   28
[4,]    4    9   14   19   24   29
[5,]    5   10   15   20   25   30
# Calculate the sum of each row
apply(X,1,sum)
[1]  81  87  93  99 105
# Calculate the length of each col
apply(X,2,length)
[1] 5 5 5 5 5 5
# Calculate the mean of each col
apply(X,2,mean)
[1]  3  8 13 18 23 28
# Calculate the sd of each col
apply(X,1,sd)
[1] 9.354143 9.354143 9.354143 9.354143 9.354143
# Defined Functions
apply(X,1, function(s) sd(s)/sqrt(length(s)))
[1] 3.818813 3.818813 3.818813 3.818813 3.818813

5.3 Data frame

  • data.frame() to create a data frame to store data in form of table.

For example:

  • Create a data frame.
BMI <- data.frame(gneder = c('Male','Male','Female'),height = c(152,171.5,165),weight = c(81,93,78),Age = c(42,38,26))
print(BMI)
  gneder height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26
str(BMI)
'data.frame':   3 obs. of  4 variables:
 $ gneder: chr  "Male" "Male" "Female"
 $ height: num  152 172 165
 $ weight: num  81 93 78
 $ Age   : num  42 38 26
# We also can create a data frame as follows:
name <- c('john','peter','patrick','julie','bob')
age <- c(28,30,31,38,35)
children <- c(FALSE,TRUE,TRUE,FALSE,TRUE)
df <- data.frame(Name = name, Age = age, Children = children)
df
     Name Age Children
1    john  28    FALSE
2   peter  30     TRUE
3 patrick  31     TRUE
4   julie  38    FALSE
5     bob  35     TRUE
  • Extract specified elements.
df[3,2]
[1] 31
df[3,]
     Name Age Children
3 patrick  31     TRUE
df[2]
  Age
1  28
2  30
3  31
4  38
5  35
df$Age
[1] 28 30 31 38 35
df[3,'Age']
[1] 31
df['Age']
  Age
1  28
2  30
3  31
4  38
5  35
df[c(3,5),c('Age','Children')]
  Age Children
3  31     TRUE
5  35     TRUE
  • Add new elements.
# We can define a new col and add it to original data frame.
height <- c(163,177,163,162,157)
df$Height <- height
df
     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
4   julie  38    FALSE    162
5     bob  35     TRUE    157
# Also we can use 'cbind' to add a new col.
weight <- c(75,65,54,34,78)
cbind(df,weight)
     Name Age Children Height weight
1    john  28    FALSE    163     75
2   peter  30     TRUE    177     65
3 patrick  31     TRUE    163     54
4   julie  38    FALSE    162     34
5     bob  35     TRUE    157     78
# Add a new row is similar.
tom = data.frame(Name='Tom',Age = 36, Children = FALSE, Height = 182)
rbind(df,tom)
     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
4   julie  38    FALSE    162
5     bob  35     TRUE    157
6     Tom  36    FALSE    182
  • Sort in one specific col.
# We can sort the data frame in a specified col.
sort(df$Age) # Sort Age from small to large only.
[1] 28 30 31 35 38
ranks = order(df$Age) # Return the order of specified age.
ranks
[1] 1 2 3 5 4
df[ranks,] # Return a sorted data frame in Age from small to large.
     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
5     bob  35     TRUE    157
4   julie  38    FALSE    162
# Descend order
df[order(df$Age,decreasing = TRUE),] 
     Name Age Children Height
4   julie  38    FALSE    162
5     bob  35     TRUE    157
3 patrick  31     TRUE    163
2   peter  30     TRUE    177
1    john  28    FALSE    163

5.4 List

  • list() — create a list.

    • list can contain object of different types;

    • list is different from vector.

For example:

  • Create a list
# We create a list including numerical vector, character vector and logical vector.
list1 <- list(x <- seq(10,30,10),y <- c('a','b','c'),z <- c(TRUE,FALSE))
list1
[[1]]
[1] 10 20 30

[[2]]
[1] "a" "b" "c"

[[3]]
[1]  TRUE FALSE
names(list1) <- c("number","letter","logical")
list1
$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE
# We can extract elements in a list
list1[[1]][2] # Extract the second element in first element
[1] 20
list1[[2]] # Extract the second element
[1] "a" "b" "c"
# Another way to create a list with name
list2 <- list(number = c(10,20,30), letter= c("a","b","c"), logical=c(TRUE, FALSE))
list2
$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE
# View the structure of list
str(list2) 
List of 3
 $ number : num [1:3] 10 20 30
 $ letter : chr [1:3] "a" "b" "c"
 $ logical: logi [1:2] TRUE FALSE
# Elements in list also could be a list
list3 <- list(number = c(10,20,30), letter= c("a","b","c"), logical=c(TRUE, FALSE),list=list2)
list3
$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE

$list
$list$number
[1] 10 20 30

$list$letter
[1] "a" "b" "c"

$list$logical
[1]  TRUE FALSE
  • Extract element in a list
# We also can extract element in a list by using [[]]
list3[[2]]
[1] "a" "b" "c"
list3[[4]]
$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE
# We also can extract element as follows.
list3[['number']]
[1] 10 20 30
list3[c(FALSE,TRUE,FALSE,TRUE)]#TRUE for selecting, FALSE for not selecting
$letter
[1] "a" "b" "c"

$list
$list$number
[1] 10 20 30

$list$letter
[1] "a" "b" "c"

$list$logical
[1]  TRUE FALSE
list3$number
[1] 10 20 30
  • Comparison between vector and list
list1 <- list('music tracks',100,5)
list1
[[1]]
[1] "music tracks"

[[2]]
[1] 100

[[3]]
[1] 5
class(list1)
[1] "list"
is.list(list1)
[1] TRUE
vec <- c('music tracks',100,5)
vec
[1] "music tracks" "100"          "5"           
class(vec)
[1] "character"
is.list(vec)
[1] FALSE

6 Flow Control

6.1 Conditional statement

  • If statement consists of a Boolean expression followed by one or more statements.

The format is as follows:

if(boolean_expression) {
  statement will execute if the boolean expression is true.
}

For example: If X is integer we will print ‘X is an Integer’ and if X is character we will print ‘X is an character’.

x = 30L
typeof(x)
[1] "integer"
if (is.integer(x)){
  print('X is an Integer')
}
[1] "X is an Integer"
if (is.character(x)){
  print('X is an character')
}
  • else statement is ececuted when the condition in the if statement results to false.

The format is as follows:

if(boolean_expression) {
  statement will execute if the boolean expression is true.
}else{
  statement will execute if the boolean expression is false.
}

For example: we divide score into multiple levels,

score = 60
if (score >= 80) {
  print('Good Score!')
} else if (score >= 60 & score < 80) {
  print('Decent Score!')
} else if (score < 60 & score >=33) {
  print('Average Score!')
} else {
  print('Poor!')
}
[1] "Decent Score!"

6.2 Loop Statement

  • while loop: An else statement is executed when the condition in the if statement results to false.

The format is as follows:

while (test_expression) {
  statement
}

For example:

  • Print ‘Hello World’ more times.
v = c('Hello World')
count = 2
while (count < 7) {
  print(v)
  count = count + 1
}
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
  • Calculate the sum from 1 to 100.
i=1
sum = 0
while (i<101) {
  sum = sum+i
  i = i+1
}
print(sum)
[1] 5050
  • for loop: A for loop is used to integrate over a list of elements or a range of numbers.

The format is as follows:

for (value in vector){
  statements
}

For example:

  • Print fruits.
fruit = c('Apple','Orange','Passion fruit','Banana')
for (i in fruit){
  print(i)
}
[1] "Apple"
[1] "Orange"
[1] "Passion fruit"
[1] "Banana"
  • Calculate sum from 1 to 100.
sum = 0 
for (i in 1:100) {
  sum = sum + i
}
print(sum)
[1] 5050

7 Functions

7.1 Custom Functions

A function is a set of statements to perform a specific task. R has a large number of in-built functions and the user can create their own functions.

Its format is as follows:

function_name <- function(arg_1, arg_2, ...){
  Function body
}

For example:

  • Calculate square from 1 to 4.
squares <- function(a){
  for (i in 1:a){
    b = i^2
    print(b)
  }
}
squares(4)
[1] 1
[1] 4
[1] 9
[1] 16
  • Calculate sum from 1 to 100.
Sum <- function(n){
  sum <- 0
  for (i in 1:n) {
    sum = sum + i
  }
  print(sum)
}
Sum(100)
[1] 5050

7.2 Bulit-in R functions

R supports a lot of built-in functions to work with data structures. For example:

  • seq() – create sequences.
seq(1,10,by=2)
[1] 1 3 5 7 9
  • append() – combine objects.
v <- c(11,4,5,7,3,10,2)
v2 <- c(1,2,3,4,5)
append(v, v2)
 [1] 11  4  5  7  3 10  2  1  2  3  4  5
  • sort() – sort sequences.
v <- c(11,4,5,7,3,10,2)
sort(v) 
[1]  2  3  4  5  7 10 11
sort(v,decreasing = T)
[1] 11 10  7  5  4  3  2
  • order() – return index of sorted vectors
ranks <- order(v)
ranks
[1] 7 5 2 3 4 6 1
v[ranks]
[1]  2  3  4  5  7 10 11
  • rank() – return ranking of elements
v
[1] 11  4  5  7  3 10  2
rank(v)
[1] 7 3 4 5 2 6 1
  • rev() – reverse elements in R objects
v2 <- c(1,2,3,4,5)
rev(v2)
[1] 5 4 3 2 1

R provides various mathematical functions to perform mathematical calculations. For example:

  • abs() – absolute of a number.
abs(-5.99)
[1] 5.99
abs(-0.002)
[1] 0.002
  • sqrt() – square root of a number.
sqrt(121)
[1] 11
  • sum() – sum of a sequences.
sum(1:100)
[1] 5050
  • floor() – round down.
floor(5.99)
[1] 5
  • ceiling()– round up.
ceiling(5.99)
[1] 6
  • round() – rounding
round(5.5)
[1] 6
round(5.1)
[1] 5
  • sign() – sign function
sign(-0.2)
[1] -1
sign(2)
[1] 1
  • max() – maximum value
vec1 <- rnorm(10,0,1)
vec2 <- runif(10,0,1)
max(vec1)
[1] 2.082767
  • min() – minimum value
min(vec2)
[1] 0.03556776
  • pmax() – maximum value in each pair
pmax(vec1,vec2)
 [1] 0.4684160 0.3739669 0.8203838 0.8803938 2.0827673 0.7684691 1.6771109
 [8] 0.2103584 0.5292005 0.6926519
  • pmin() – minimum value in each pair
pmin(vec1,vec2)
 [1]  0.4452633597  0.3703759535  0.6707883861  0.5288044549  0.0355677570
 [6] -0.6783278153  0.6712036373 -0.8646459158  0.3169878945  0.0001665582
  • which.max() – index of maximum
which.max(vec1)
[1] 5
  • which.min() – index of minimum
which.min(vec1)
[1] 8
  • which() – find index
which(vec1>0)
[1]  1  2  3  4  5  7  9 10
  • sample() – random sampling
sample(1:10, 3, replace = TRUE)
[1] 8 2 1
sample(LETTERS[1:6], 3, replace = FALSE)
[1] "A" "F" "B"

8 Data Manipulation

8.1 Data Manipulation with dplyr

  • dplyr – A package is used to transform and summarize tabular data with rows and columns.

    • install.packages("dplyr")

    • library(dplyr)

There are some frequently-used built-in functions in dplyr, such as:

  • select()

  • fillter()

  • arrange()

  • summarise()

  • mutate()

  • transmute()

  • group_by()

We will use student data set mentioned above as an example to illustrate the convenience of dplyr. The structure of this data set is as follows:

library(dplyr)
student <- read.csv("./students.csv") # read data set
head(student) # Just show first 6 rows.
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1
  • select() – Selects column variables based on their names.
# We select column 'Height' in student.
s1 <- select(student, Height)
head(s1)
  Height
1    172
2    169
3    167
4    167
5    167
6    164
# We select column 'Weight' in student.
s2 <- select(student, Weight)
head(s2)
  Weight
1   78.1
2   51.3
3   50.8
4   58.1
5   60.8
6   58.1
  • fillter() – Filter rows based on their values.
# We extract all male students.
f1 <- filter(student,Gender=="M")
head(f1)
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1
# We extract female students with height > 155.
f2 <- filter(student,Gender=="F", Height>155)
head(f2)
    Name Age Height Gender Weight
1 JACLYN  12    162      F   65.8
2 LESLIE  14    159      F   64.5
3 MARTHA  16    159      F   50.8
4    AMY  15    157      F   50.8
  • arrange() – Changes the ordering of the rows.
# We will arrange the students in ascending order of Height.
a1 <- arrange(student,Height)
head(a1)
     Name Age Height Gender Weight
1  ROBERT  12    125      M   35.9
2  LILLIE  12    127      F   29.1
3    JANE  12    135      F   33.6
4   SUSAN  13    137      F   30.4
5 MICHAEL  13    142      M   43.1
6   DAVID  13    145      M   35.9
# We will arrange the students in descending order of Weight.
a2 <- arrange(student,desc(Weight))
head(a2)
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2   JACLYN  12    162      F   65.8
3   LESLIE  14    159      F   64.5
4     KIRK  17    167      M   60.8
5  PHILLIP  16    167      M   58.1
6   ROBERT  15    164      M   58.1
  • summarise() – Reduces multiple values down to a single summary.
# We want obtain the average of Height
summarise(student,avg_Height=mean(Height,na.rm=T))
  avg_Height
1     153.25
# We want obtain the sum of Height
summarise(student,tot_Height=sum(Height,na.rm=T))
  tot_Height
1       6130
# We want obtain the std of Weight
summarise(student,stdev_Weight=sd(Weight,na.rm=T))
  stdev_Weight
1     10.07415
# We want obtain the average and sum of Weight simultaneously
summarise(student,avg_Weight=mean(Weight,na.rm=T),tot_Weight=sum(Weight,na.rm=T))
  avg_Weight tot_Weight
1    47.6625     1906.5
  • mutate() – Creates columns that are functions of existing variables.
# We will add a new column to the original data set and name this column 'BMI' 
# where BMI = Weight/Height^2.
student1 <- mutate(student, BMI = Weight / (0.01*Height)^2)
head(student1)
      Name Age Height Gender Weight      BMI
1 LAWRENCE  17    172      M   78.1 26.39941
2  JEFFERY  14    169      M   51.3 17.96156
3   EDWARD  14    167      M   50.8 18.21507
4  PHILLIP  16    167      M   58.1 20.83259
5     KIRK  17    167      M   60.8 21.80071
6   ROBERT  15    164      M   58.1 21.60173
  • transmute() – Used to show only new column.
# Same example as above.
student2 <- transmute(student, BMI = Weight / (0.01*Height)^2)
head(student2)
       BMI
1 26.39941
2 17.96156
3 18.21507
4 20.83259
5 21.80071
6 21.60173
  • group_by() – Used to group the data set by some column. But usually it works with other functions.
# We group by Gender in student and calculate the sum and mean of Height for each group.
by_type <- group_by(student, Gender)
summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.

As we can see, the student data set is divided into 2 groups by Gender.

8.2 Usage of pipe operator %>%

The pipe symbol %>% in the dplyr package can take the output of the previous function as the input of the next function, which makes the operation more convenient. Let’s see some examples.

  • Example1: We group by Gender in student and calculate the sum and mean of Height for each group.
# Without pipe %>%.
by_type <- group_by(student, Gender)
summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.
# With pipe %>%.
student %>% group_by(Gender) %>% summarise(Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.
  • Example2: we screen out all male students, and sampling 5 samples and arrange them in descend order of Height.
# Without pipe %>%.
f <- filter(student,Gender == "M")
s <- sample_n(f,size = 5)
a <- arrange(s,desc(Height))
a
     Name Age Height Gender Weight
1  EDWARD  14    167      M   50.8
2 PHILLIP  16    167      M   58.1
3   HENRY  14    159      M   54.0
4  ALFRED  14    157      M   44.9
5     JOE  13    154      M   47.7
# With pipe %>%.
a <- student %>% filter(Gender == "M") %>% sample_n(size = 5) %>% arrange(desc(Height))
a
    Name Age Height Gender Weight
1  CHRIS  14    157      M   44.9
2  LEWIS  14    157      M   41.8
3   MARK  15    152      M   47.2
4    TIM  12    147      M   38.1
5 ROBERT  12    125      M   35.9

8.3 Data Manipulation with tidyr

  • tidyr – A package helps ue create tidy data. A tidy data is easy to visualize and model.

    • install.packages("tidyr")

    • library(tidyr)

There are some frequently-used built-in functions in tidyr, such as:

  • gather()

  • separate()

  • unite()

  • spread()

We first to create a data frame:

library(tidyr)
data <- data.frame(
  ID = c(1:10),
  Face_1 = c(411,723,325,456,579,612,709,513,527,379),
  Face_2 = c(123,300,400,500,600,654,789,906,413,567),
  Face_3 = c(1457,1000,569,896,956,2345,780,599,1023,678)
)
print(data)
   ID Face_1 Face_2 Face_3
1   1    411    123   1457
2   2    723    300   1000
3   3    325    400    569
4   4    456    500    896
5   5    579    600    956
6   6    612    654   2345
7   7    709    789    780
8   8    513    906    599
9   9    527    413   1023
10 10    379    567    678
  • gather() – Reshape data from wide format to long format.
# We elongate the second column 'Face_1' to the fourth column 'Face_3' 
# into one column and name them 'Face' and 'ResponseTime'.
long <- data %>% gather(Face, ResponseTime, Face_1:Face_3)
long
   ID   Face ResponseTime
1   1 Face_1          411
2   2 Face_1          723
3   3 Face_1          325
4   4 Face_1          456
5   5 Face_1          579
6   6 Face_1          612
7   7 Face_1          709
8   8 Face_1          513
9   9 Face_1          527
10 10 Face_1          379
11  1 Face_2          123
12  2 Face_2          300
13  3 Face_2          400
14  4 Face_2          500
15  5 Face_2          600
16  6 Face_2          654
17  7 Face_2          789
18  8 Face_2          906
19  9 Face_2          413
20 10 Face_2          567
21  1 Face_3         1457
22  2 Face_3         1000
23  3 Face_3          569
24  4 Face_3          896
25  5 Face_3          956
26  6 Face_3         2345
27  7 Face_3          780
28  8 Face_3          599
29  9 Face_3         1023
30 10 Face_3          678
  • separate() – Split a single column into multiple columns.
# We divide the column 'Face' of 'long' data set into two columns,
# and name them 'Target' and 'Number'.
long_separate <- long %>% separate(Face, c("Target", "Number"), sep = "_")
long_separate
   ID Target Number ResponseTime
1   1   Face      1          411
2   2   Face      1          723
3   3   Face      1          325
4   4   Face      1          456
5   5   Face      1          579
6   6   Face      1          612
7   7   Face      1          709
8   8   Face      1          513
9   9   Face      1          527
10 10   Face      1          379
11  1   Face      2          123
12  2   Face      2          300
13  3   Face      2          400
14  4   Face      2          500
15  5   Face      2          600
16  6   Face      2          654
17  7   Face      2          789
18  8   Face      2          906
19  9   Face      2          413
20 10   Face      2          567
21  1   Face      3         1457
22  2   Face      3         1000
23  3   Face      3          569
24  4   Face      3          896
25  5   Face      3          956
26  6   Face      3         2345
27  7   Face      3          780
28  8   Face      3          599
29  9   Face      3         1023
30 10   Face      3          678
  • unite() – Combine multiple columns into a single column.
# We combine the columns 'Target' and 'Number' of 'long_separate' data set 
# into a single column, we name it 'Face'.
long_unite <- long_separate %>% unite(Face, Target, Number, sep = "_")
long_unite
   ID   Face ResponseTime
1   1 Face_1          411
2   2 Face_1          723
3   3 Face_1          325
4   4 Face_1          456
5   5 Face_1          579
6   6 Face_1          612
7   7 Face_1          709
8   8 Face_1          513
9   9 Face_1          527
10 10 Face_1          379
11  1 Face_2          123
12  2 Face_2          300
13  3 Face_2          400
14  4 Face_2          500
15  5 Face_2          600
16  6 Face_2          654
17  7 Face_2          789
18  8 Face_2          906
19  9 Face_2          413
20 10 Face_2          567
21  1 Face_3         1457
22  2 Face_3         1000
23  3 Face_3          569
24  4 Face_3          896
25  5 Face_3          956
26  6 Face_3         2345
27  7 Face_3          780
28  8 Face_3          599
29  9 Face_3         1023
30 10 Face_3          678
  • spread() – Take two columns (key & value) and spreads in to multiple columns.
# We split the column 'Face' of 'long_unite' data set back into three 
# columns 'Face_1', 'Face_2' and 'Face_3'.
back_to_data <- long_unite %>% spread(Face, ResponseTime)
back_to_data
   ID Face_1 Face_2 Face_3
1   1    411    123   1457
2   2    723    300   1000
3   3    325    400    569
4   4    456    500    896
5   5    579    600    956
6   6    612    654   2345
7   7    709    789    780
8   8    513    906    599
9   9    527    413   1023
10 10    379    567    678

9 Data Visualization

9.1 Data Visualization in R

R supports a variety of functions and data visualization packages to build interactive visuals for exploratory data analysis.

  • plot() – It use more of a generic function for plotting R objects.

  • barplot() – It is used plot data using rectangular bars.

  • hist() – It is used to create histograms.

  • boxplot() – It is used to represent data in the form of quartiles.

  • ggplot() – This package enables the users to create sophisticated visualizations with little code using the Grammar of Graphics.

  • plotly() – It create interactive web-based graphs via the open source JavaScript library plotly.js.

Next, we will mainly introduce the usage of plot(), and the barplot(), hist(), boxplot() are introduced in the next section. The other two ggplot() and plotly() are left to students to explore.

We also use student data set to illustrate plot() function.

student <- read.csv("./students.csv")
# We first draw a scatter-plot for Height vs Weight.
plot(student$Height,
     student$Weight,
     type = "p", # type of plot: points 
     pch = 16, # shape of scattered points
     xlab = "Height",
     ylab = "Weight",
     xlim = c(120,175),
     ylim = c(20,80),
     main = "Hieght vs Weight",
     col = "red")

# We next draw a Line-plot for Height.
plot(student$Height,
     type = "l", # type of plot: line 
     lty = 1, # type of line
     lwd = 3, # width of line
     xlab = "Student",
     ylab = "Height",
     main = "Hieght of Student",
     ylim = c(120,175),
     col = "blue")

# We draw three lines for Height, Weight and Age.
# standardization
height <- (student$Height - mean(student$Height))/sd(student$Height)
weight <- (student$Weight - mean(student$Weight))/sd(student$Weight)
age <- student$Age - mean(student$Age)
plot(height,
     type = "l", # type of plot: line 
     lty = 2, # type of Dotted line
     lwd = 3, # width of line
     xlab = "Student",
     ylab = "Values",
     main = "Height, Weight and Age of Student",
     col = "blue")
lines(weight,
     type = "l", # type of plot: line 
     lty = 1, # type of line
     lwd = 3, # width of line
     col = "red")
points(age, pch = 16, col = "brown")
legend('topright',
       legend = c('Height','Weight','Age'),
       col = c('blue','red','brown'),
       lty = c(3,1,0), 
       pch = c(NA,NA,16),
       lwd = c(3,3,2),
       ncol = 1)

Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

10 Data Description

10.1 Data Set

10.1.1 Data Table

In statistics analysis, the data is often a description of several individuals (objects), and the description of each individual is a number of indicators (characteristic) of our concern. For example, we conducted a survey on students in a class, denote the \(n\) indicators of \(i\)-th object as \[ x_{i1},x_{i2},\cdots ,x_{in}. \] There are a total of \(m \times n\) data in the records of \(m\) students, which can be recorded in table:

Number variable1 variable2 \(\cdots\) variable\(n\)
1 \(x_{11}\) \(x_{12}\) \(\cdots\) \(x_{1n}\)
2 \(x_{21}\) \(x_{22}\) \(\cdots\) \(x_{2n}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
\(m\) \(x_{m1}\) \(x_{m2}\) \(\cdots\) \(x_{mn}\)

The following table shows the information including Name, Height, Age, Gender, Weight of a 40 students class: (Table 1.1-2 in textbook)

student <- read.csv("./students.csv")
head(student)
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1
Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

10.1.2 Data File

The data recording each attribute of each observation is stored in computer in a data file with a specific format. These data files have different names on different occasions.

  • In Mathmatics-It it is a matrix;
  • In Database-It is a table;
  • In Statistic-It is records of observation valus;
  • In R-It is data frame.

Each row of this matrix is called a record or observation, and it records the indicators values of various characteristics of the observation object.

Each column of the matrix is called a field or variable, and it records the indicator value of a certain characteristic of all observed objects.

Generally, data in the same column is required to have the same attributes. For example, the data in a column indicating Gender can only use characters M or F while only number for Height column.

10.1.3 Table Merging

In statistical analysis, it is often necessary to collect individual characteristics of the research object into one table. In RStudio, there three commonly used functions to merge data frames.

  • rbind()-To merge data frames with same variables vertically.
data1 <- student[1:5,]
data1
      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
data2 <- student[9:14,]
data2
      Name Age Height Gender Weight
9     CLAY  15    162      M   47.7
10   HENRY  14    159      M   54.0
11  LESLIE  14    159      F   64.5
12    JOHN  13    159      M   44.5
13 WILLIAM  15    159      M   50.4
14  MARTHA  16    159      F   50.8
data3 <- rbind(data1,data2)
data3
       Name Age Height Gender Weight
1  LAWRENCE  17    172      M   78.1
2   JEFFERY  14    169      M   51.3
3    EDWARD  14    167      M   50.8
4   PHILLIP  16    167      M   58.1
5      KIRK  17    167      M   60.8
9      CLAY  15    162      M   47.7
10    HENRY  14    159      M   54.0
11   LESLIE  14    159      F   64.5
12     JOHN  13    159      M   44.5
13  WILLIAM  15    159      M   50.4
14   MARTHA  16    159      F   50.8
  • cbind()-To merge data frames with same number of rows horizontally.
Grades <- c(88,78,90,65,73)
Level <- c("A","B","A","C","B")
data4 <- data.frame(Grades,Level)
data4
  Grades Level
1     88     A
2     78     B
3     90     A
4     65     C
5     73     B
data5 <- cbind(data1,data4)
data5
      Name Age Height Gender Weight Grades Level
1 LAWRENCE  17    172      M   78.1     88     A
2  JEFFERY  14    169      M   51.3     78     B
3   EDWARD  14    167      M   50.8     90     A
4  PHILLIP  16    167      M   58.1     65     C
5     KIRK  17    167      M   60.8     73     B
  • merge()-To merge data frames based on common variables.
Name <- c("LAWRENCE","JEFFERY", "EDWARD", "PHILLIP", "KIRK")
Grades <- c(88,78,90,65,73)
Level <- c("A","B","A","C","B")
data6 <- data.frame(Name,Grades,Level)
data6
      Name Grades Level
1 LAWRENCE     88     A
2  JEFFERY     78     B
3   EDWARD     90     A
4  PHILLIP     65     C
5     KIRK     73     B
data7 <- merge(data1, data6, by = "Name")
data7
      Name Age Height Gender Weight Grades Level
1   EDWARD  14    167      M   50.8     90     A
2  JEFFERY  14    169      M   51.3     78     B
3     KIRK  17    167      M   60.8     73     B
4 LAWRENCE  17    172      M   78.1     88     A
5  PHILLIP  16    167      M   58.1     65     C

10.1.4 Special Value

During the data collection, sometimes, special values can occur due to certain reasons. In R, there are three special values.

  • Inf — Infinity;

  • NaN — Not a Number;

  • NA — Not Available (Missing value).

some calculation about these special values.
2/0;2/Inf;exp(-Inf);Inf-Inf;Inf/Inf;0/0;
[1] Inf
[1] 0
[1] 0
[1] NaN
[1] NaN
[1] NaN
  • is.na(),is.nan() — To identify NA and NaN in a data set.
x <- c(5,6,8,NA,23,NA,9,2)
is.na(x) # Identify missing values
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
which(is.na(x)) # Identify the index of missing value
[1] 4 6
x[! is.na(x)] # Remove missing values
[1]  5  6  8 23  9  2
sum(is.na(x)) # Count missing values
[1] 2
y <- c(6,6,NaN,0,23,3,9,NaN)
is.nan(y) # Identify NaN
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
which(is.nan(y)) # Identify the index of NaN
[1] 3 8
y[! is.nan(y)] # Remove NaN
[1]  6  6  0 23  3  9
sum(is.nan(y)) # Count NaN
[1] 2
  • na.omit(), na.rm(), -To remove special values.
mean(x, na.rm = TRUE)
[1] 8.833333
mean(y, na.rm = TRUE)
[1] 7.833333
mean(na.omit(x))
[1] 8.833333
mean(na.omit(y))
[1] 7.833333

10.1.5 Type of Variables

According to meaning and measurement, it can be divided into quantitative and qualitative: \[ Quantitative= \begin{cases} Continuous \\ Discrete \end{cases} \\ Qualitative = \begin{cases} Ordinal \\ Nomibnal \end{cases} \]

  • Quantitative: Height, weight, age etc. The value of such variable is a quantity, and it is meaningful to perform arithmetic operations on its value.

    • Continuous: The possible values of such variables can fill the entire interval, such as height and weight, which also called interval and real.

    • Discrete: This type of variable can only take on a limited number of values, such as age, etc.

  • Qualitative: Gender, variety, province etc. Different values of such variables have different meanings, and their different values cannot be arithmetic or have no meaning.

    • Nomibnal: There is no natural order relationship between the values of such variables, such as gender, province, etc.

    • Ordinal: Such as: size (extra large, large, medium, small).

Discrete, Nomibnal and Ordinal are collectively referred to as categorical.

10.2 Frequency

10.2.1 Frequency Statistics

The table contains observations for multiple variables, starting with the analysis of a single variable. Let the results of \(n\) observations of a variable: \[x_1, x_2, \cdots, x_n.\] We first need to know What values are taken and the proportions of different values, which refers to the distribution of these observation.

Gender Frequency Percent Cumulative Frequency Cumulative Percent
M 22 55% 22 55%
F 18 45% 40 100%
Age Frequency Percent Cumulative Frequency Cumulative Percent
12 8 20% 8 20%
13 7 17.5% 15 37.5%
14 12 30% 27 67.5%
15 7 17.5% 34 85%
16 3 7.5% 37 92.5%
17 3 7.5% 40 100%
Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

Besides of this, we can perform frequency statistics on multiple variable value combinations.

10.2.2 Bar Plot

In addition to recording frequencies in table, we can also use bar plots where there are multiple columns with the same width juxtaposed, and the height of the column represents the frequency.

  • barplot()-Bar plot.

  • pie()-Pie plot.

# Load data
stat <- table(student$Age)
Age <- as.numeric(names(stat))
Freq <- as.numeric(stat)
age <- data.frame(Age, Freq)
# Vertical bar plot of age
col <- rainbow(6)
barplot(Freq~Age,
        data = age,
        xlab = "Age",
        ylab = "Frequency",
        col = col,
        border = "black",
        main = "Vertical bar plot of Age distribution")

# Horizontal bar plot of age
col <- rainbow(6)
barplot(Freq,
        names.arg = Age,
        xlab = "Age",
        ylab = "Frequency",
        col = col,
        border = "black",
        horiz = TRUE,
        main = "Horizontal bar plot of Age distribution")

# pie plot
Age <- as.character(Age) # Turn number to character
age <- data.frame(Age, Freq)
col = rainbow(6)
pie(age$Freq,
    labels = age$Freq,
    radius = 0.9,
    main = "Pie plot of Age distribution",
    col = col)
legend("topright", age$Age, cex = 0.8, fill = col)

Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

10.2.3 Histogram

If weight is appropriately grouped based on their values and frequency statistics are conducted, the distribution information of variable values in different ranges can be obtained.

Weight Frequency Percent Cumulative Frequency Cumulative Percent
24~32 2 5% 2 5%
32~40 7 17.5% 9 22.5%
40~48 12 30% 21 52.5%
48~56 12 30% 33 82.5%
56~64 4 10% 37 92.5%
64~72 2 5% 39 97.5%
72~80 1 2.5% 40 100%

The most commonly used bar plot for calculating the frequency of interval variables after grouping is the histogram:

  • The width of each column in the histogram is the width of each group;

  • The area of each column in the histogram is the proportion of observations falling into this group, so the height of the column is the ratio of these observations divided by the width.

# Histogram of weight
break1 <- c(24,32,40,48,56,64,72,80)
col <- rainbow(length(break1))
hist(student$Weight,
     breaks = break1,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break2 <- seq(20,80,5)
col <- rainbow(length(break2))
hist(student$Weight,
     breaks = break2,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break3 <- seq(20,80,10)
col <- rainbow(length(break3))
hist(student$Weight,
     breaks = break3,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break4 <- seq(29,79,2.5)
col <- rainbow(length(break4))
hist(student$Weight,
     breaks = break4,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

10.3 Moments Type

10.3.1 Summary of common statistics

  • Based on the method of its generation:

    • Based on moments of observation: mean, variance, etc;

    • Based on order statistics of observation: median, range, quantile, etc.

  • Based on described characteristics:

    • Describes the center of distribution: mean, median, etc;

    • Describe the degree of dispersion: variance, range, etc;

    • Others statistics describing distribution and its shape.

10.3.2 Mean

If a set of observations gives: \(x_1,x_2, \cdots, x_n\), where \(n\) is sample size, then its mean gives that \[ \bar x = \frac{1}{n}\sum_{i=1}^{n}x_i, \] which is used to describe the central position of this set of observations.

10.3.3 Variance and Standard Variance

Two forms of variance gives that, \[ s^{*2} = s^{*2}(x) = s^{*2}_n(x)= \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x), \] \[ s^{2} = s^{2}(x) = s^{2}_n(x)= \frac{1}{n}\sum_{i=1}^n (x_i-\bar x), \] which is always used to describe the dispersion of observations. \(s^{2}_n(x)\) is biased variance and \(s^{*2}_n(x)\) is unbiased variance respectively.

Moreover:

  • Standard variance: \(s^{*}_n(x) = \sqrt{s^{*2}_n(x)} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)}\);

  • Standard error of mean: \(\frac{s}{\sqrt{n}}\);

  • Coefficient of variation: \(cv = \frac{s}{\bar x} \cdot 100\%\).

If \(y_i = a x_i +b\), \(1\leq i \leq n\), then \[ \bar y = a \bar x, \quad s^{*2}(y) = a^2 s^{*2}(x), \quad s^{*}(y) = |a|s^{*}(x). \] For \(x = (x_1,x_2, \cdots, x_n)\), then \[ y_i = \frac{x_i-\bar x}{s^{*}(x)}, \] is called standardization of \(x\), also called satndard score. Its mean and variance are 0 and 1 after standardization.

10.3.4 Coefficient of skewness and Kurtosis

Another two statistics related to moments are:

  • Coefficient of skewness or skewness: \[ g_1 = \frac{1}{ns^3}\sum_{i=1}^n(x_i-\bar x)^3, \] more precisely, \[ g_1 = \frac{n}{(n-1)(n-2)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^3. \]
  • Coefficient of kurtosis or kurtosis \[ g_2 = \frac{1}{ns^4}\sum_{i=1}^n(x_i-\bar x)^4-3, \] more precisely, \[ g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^4-3\frac{(n-1)^2}{(n-2)(n-3)}. \]

Notice that:

  • The skewness describes symmetry of the data about its center position – mean;

  • The skewness is close to 0 indicating the distribution is symmetrical about its mean;

  • A positive (negative) skewness coefficient indicates that the data has a longer tail on the right (left).

  • The kurtosis describe the tail of the data distribution base on the shape of the normal distribution;

  • The kurtosis is close to 0 as shape of the distribution is close to normal;

  • A positive (negative) kurtosis indicates the data distribution has thicker (thinner) tails on both sides.

10.3.5 Calculation with R

We use Weight data to calculate above statistics with R:

weight <- student$Weight
# mean
mean(weight)
[1] 47.6625
# variance
var(weight)
[1] 101.4886
# standard variance
sd(weight)
[1] 10.07415
# coefficient of variation
sd(weight)/mean(weight)
[1] 0.2113643
library(moments) # skewness and kurtosis
library(plotrix) # standard error of mean
# standard error of mean
std.error(weight)
[1] 1.592863
# skewness
skewness(weight)
[1] 0.5939948
# kurtosis
kurtosis(weight)
[1] 3.697771

10.4 Order Statistic Type

10.4.1 Order Statistic

Let \(x_1,x_2, \cdots, x_n\) are a set of observations, arrange them from small to large, \[ x_{1,n} \leq x_{2,n} \leq \cdots \leq x_{n,n}, \] that is, \[ \begin{cases} x_{(1)} = x_{1,n} = \min\{ x_1, x_2, \cdots, x_n\} \\ x_{(2)} = x_{2,n} = \mathop{\min}\limits_{1\leq i,j \leq n}\max\{ x_i, x_j\}=\mathop{\max}\limits_{1\leq i_1,...,i_{n-1} \leq n}\min\{ x_{i-1}, x_{i_{n-1}}\}\\ \cdots \cdots \\ x_{(n)} = x_{n,n} = \max\{ x_1, x_2, \cdots, x_n\} \end{cases} \] \((x_{(1)}, x_{(2)}, \cdots, x_{(n)})\) are called Order Statistics of these observations.

10.4.2 Median

For a set of observation \(x_1,x_2, \cdots, x_n\), median is defined as \[ m= \begin{cases} x_{\frac{n+1}{2},n}, ~n~odd, \\ \frac{1}{2}\left(x_{\frac{n}{2},n}+x_{\frac{n}{2}+1,n}\right), ~n~even, \end{cases} \] which is also a quantity describing the center of the distribution, so the number of observations greater than or less than the median is roughly half. It is easy to calculate and is not affected by extreme data (robustness).

10.4.3 Range

For observations \(x_1,x_2, \cdots, x_n\), range is defined as \[ r = x_{n,n}-x_{1,n} = \mathop{\max}\limits_{1\leq i \leq n} x_i - \mathop{\min}\limits_{1\leq i \leq n} x_i, \] which is often used to describe the dispersion.

10.4.4 Empirical Distribution

For observations \(x_1,x_2, \cdots, x_n\), we call discrete distribution: \[ \begin{pmatrix} x_1 & x_2 & \cdots & x_n \\ \frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n} \end{pmatrix} \] as empirical distibution. Its empirical distibution function gives \[ \widehat F_n(x)= \begin{cases} 0,~ x < x_{1,n}\\ \frac{k}{n}, ~x_{k,n} \leq x < x_{k+1,n},~ 1\leq k \leq n-1\\ 1,~x\ge x_{n,n} \end{cases} =\frac{1}{n}\sum_{i=1}^n I_{\{ x \ge x_i \}}, \] where \(I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)\) is indicator function of \((x \ge x_i)\), that is, \[ I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)= \begin{cases} 1,~x\ge x_i \\ 0,~ otherwise. \end{cases} \] Notice that:

  • mean and variance(biased) are mean and variance of empirical distribution respectively;

  • All moments type statistics are established based on moments of empirical distribution.

The following plot show the empirical distribution of Weight data set.

Empirical distribution completely describes the distribution of observation and is also an important tool in statistical inference and large-sample theorey.

10.4.5 Quantile

For a \(p \in (0,1)\), the \(p\)-quantile of distribution \(F\) is \(\zeta_p\) satisfying \(F(\zeta_p)=p\). The solution of \(\widehat F_n(x) = p\) may not exist or be not unique even if exist, since empirical distribution is not strictly monotonous and continuous.

For observations \(x_1,x_2, \cdots, x_n\), sample quantile will take the value, satisfying

  • Makes the number of observations less than \(p\)-quantile approximate \(np\);

  • Makes the number of observations lager than \(p\)-quantile approximate \(n(1-p)\);

  • A number close to \(x_{[np],n}\) and \(p\)-quantile of empirical distribution \(\widehat F\).

In practice, there are multiple ways to calculate \(p\)-quantile, but their values are closely when the sample size is too large. In R, the formula gives: \[ z_p = (1-g)x_{j,n} + gx_{j+1,n} \] where \(j = [(n-1)p]+1\) and \(g = (n-1)p-j+1\). (More ways can be seen in textbook 1.1.17)

10.4.6 Quartile and Interquartile Range

Quartile and percentile are the most commonly used Among the various quantiles.

  • \(\frac{i}{4}\)-quantile is called the \(i-\)th quartile, denoted as \(q_i\);

    • 0.25-quantile \(q_1\) is called upper quartile;

    • 0.75-quantile \(q_3\) is called lower quartile;

  • \(\frac{i}{100}\)-quantile is called the \(i-\)th percentile.

Moreover, IQR(interquartile range) defined as \[ IQR = q_3-q_1 \] is also used to describe the dispersion of the distribution.

10.4.7 Box plot

Boxplots are often used to represent quantiles to describe distribution information.

  • Boxplot consists of a rectangular box with whiskers on both sides;

  • The upper (lower) side of rectangular box is upper (lower) quartiles respectively, so the width of rectangular box is interquartile range;

  • The middle horizontal line corresponds to the median;

  • The whiskers on both sides of box represent the positions of the farthest data points extending from the edge of quartile to 1.5 times the IQR;

  • The top (bottom) line corresponds to max (min) value;

  • Extreme data will lie out of this range.

We also use student dataset to show some boxplots:

# Weight and Height
Weight <- student$Weight
Height <- student$Height
# Standardization
weight <- (Weight-mean(Weight))/sd(Weight)
height <- (Height-mean(Height))/sd(Height)
col <- rainbow(2)
boxplot(weight, 
        height,
        names = c("Weight","Height"),
        col = col,
        main = "Weight and Height for students")

# Weight for different gender
col <- rainbow(2)
boxplot(Weight~Gender,
        data = student,
        col = col,
        main = "Weight with diffrent gender")

# Weight and Height for different gender
gender <- student$Gender
boxplot(weight~gender,
        col = "red",
        boxwex = 0.3, # set width of box
        at = 1:2+0.2, # set position of box on x-axis
        ylab = "Values",
        main = "Weight and Height for different gender")
boxplot(height~gender,
        col = "blue",
        boxwex = 0.3,
        at = 1:2 - 0.2,
        add = TRUE)
legend("topright", legend = c("Height","Weight"),col = c("blue","red"),pch = c(15,15))

Name Age Height Gender Weight
LAWRENCE 17 172 M 78.1
JEFFERY 14 169 M 51.3
EDWARD 14 167 M 50.8
PHILLIP 16 167 M 58.1
KIRK 17 167 M 60.8
ROBERT 15 164 M 58.1
JACLYN 12 162 F 65.8
DANNY 15 162 M 48.1
CLAY 15 162 M 47.7
HENRY 14 159 M 54.0
LESLIE 14 159 F 64.5
JOHN 13 159 M 44.5
WILLIAM 15 159 M 50.4
MARTHA 16 159 F 50.8
LEWIS 14 157 M 41.8
AMY 15 157 F 50.8
ALFRED 14 157 M 44.9
CHRIS 14 157 M 44.9
FREDRICK 14 154 M 42.2
CAROL 14 154 F 38.1
JOE 13 154 M 47.7
MARY 15 152 F 41.8
LINDA 17 152 F 52.7
MARK 15 152 M 47.2
PATTY 14 152 F 38.6
ELIZABET 14 152 F 41.3
JUDY 14 149 F 36.8
LOUISE 12 149 F 55.8
ALICE 13 149 F 48.6
JAMES 12 149 M 58.1
MARIAN 16 147 F 52.2
TIM 12 147 M 38.1
BARBARA 13 147 F 50.8
DAVID 13 145 M 35.9
KATIE 12 145 F 43.1
MICHAEL 13 142 M 43.1
SUSAN 13 137 F 30.4
JANE 12 135 F 33.6
LILLIE 12 127 F 29.1
ROBERT 12 125 M 35.9

10.4.8 Calculation with R

We also use Weight data to calculate above statistics with R:

weight <- student$Weight
# min
min(weight)
[1] 29.1
# max
max(weight)
[1] 78.1
# median
median(weight)
[1] 47.7
# range
max(weight) - min(weight)
[1] 49
# quantiles
prob1 <- seq(0,1,by= 0.2)
quantile(weight, prob1)
   0%   20%   40%   60%   80%  100% 
29.10 38.50 44.74 50.56 54.36 78.10 
# quartiles
prob2 <- c(0.25, 0.5, 0.75, 1)
quantile(weight, prob2)
   25%    50%    75%   100% 
41.675 47.700 52.325 78.100 
# interquartile range
IQR(weight)
[1] 10.65

10.5 Other Statistics

10.5.1 Mode

For a set of observations \(x_1,x_2,...,x_n\) and their frequency statistics are: \[ \begin{pmatrix} x_1 & x_2 & \cdots & x_k \\ n_1 & n_2 & \cdots & n_k \end{pmatrix} \] then, we call \(x_{j_0}\) as mode if \(n_{j_0} = \max\{n_1, n_2, \cdots, n_k\}\), represents the value obtained the most from the observed values.

10.5.2 Coefficient of Correlation

In statistical analysis, we need to analyze the relationship between different variables. For continuous variables, the Pearson cofficient of correlation is the most commonly used to describes the linear correlation between variables.

For a set of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) of r,v, \(X,Y\) with sample size \(n\), the Pearson cofficient of correlation \(r\) gives: \[ r = r(x,y) = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2 \sum_{i=1}^n(y_i-\bar y)^2}}. \]

  • The value of the correlation of coefficient lies in \([-1,1]\);

  • The correlation of coefficient is positive (negative), also known as \(x\) and \(y\) are positively (negatively) correlated;

  • The larger absolute value of correlation of coefficient, the stronger the linear relationship between \(x\) and \(y\);

  • Correlation of coefficient stay unchanged after linear transformation, that is, \[ \tilde x_i = a x_i +b, \quad \tilde y_i = c y_i +d, \quad 1 \leq i \leq n, \] then \[ r(\tilde x, \tilde y) = r(x,y). \]

Particularly, sample covariance gives: \[ c(x,y) = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y). \]

10.5.3 Calculation with R

We also use student data to calculate above statistics with R:

weight <- student$Weight
Height <- student$Height
age <- student$Age
# mode
freq <- table(age)
freq
age
12 13 14 15 16 17 
 8  7 12  7  3  3 
as.numeric(names(freq)[which.max(freq)])
[1] 14
# correlation of coefficient
cor(weight,height, method = "pearson")
[1] 0.7084378
# covariance
cov(weight,height)
[1] 7.136911