Elements of Statistics

—An Introduction to R and Rstudio

Yanxi Hou

School of Data Science, Fudan University

1 Introduction to R

1.1 Brief Introduction to R

R is a widely used tool in statistics, an open-source software of the GNU system, and an excellent tool for statistical calculations and plotting.
RStudio is R’s IDE. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, debugging, and workspace management.

1.2 Example Demonstration

Let’s first provide an example to use iris dataset for preliminary statistical analysis.

The library() function can load the package datasets which iris dataset bulit in.
The head() function Show the first six lines of iris data.

library(datasets) # Load built-in data sets
head(iris)    # Show the first six lines of iris data

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

As we can see, the iris data set includes five dimensions which are Sepal.Length , Sepal.Width, Petal.Length ,Petal.Width and Species. We also can use View function to see the whole data set of iris.

The summary() function summarized the basic statistics for each dimension:

summary(iris) # Summary statistics for iris data

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

We also can draw the scatterplot for iris dataset with plot() function.

plot(iris)    # Scatterplot matrix for iris data

2 Variables in R

2.1 Variable in R

Variables are used to store data with named locations that your programs can manipulate;
A variable name can be a combination of letters, digits, period and underscore.

Valid variables	Invalid variables
total	tot@l
sum_data	5um
.count	_count
.count.total	TRUE
Var	.0ar

We give some valid named variables.

# character string
hello_string <- 'hello'
hello_string

[1] "hello"

animal <- 'tiger'
animal

[1] "tiger"

# numeric variable
.pairs <- 100
.pairs

[1] 100

first_name <- 1.23
first_name

[1] 1.23

# complex variable
comp <- 1 +2i
comp

[1] 1+2i

# vector
vector0 <- c(1,2,3,7,6)
vector0

[1] 1 2 3 7 6

vector_1 <-  1:5
vector_1

[1] 1 2 3 4 5

# string vector
vector2 <- c('red','yellow','green')
vector2

[1] "red"    "yellow" "green"

We give some invalid named variables.

2pairs = 100
.2pairs = 100
first num = 100

3 Data Types in R

3.1 Data Types in R

We can use class() function to obtain the class of object which includes numeric, character, complex, integer etc.
Similarly, we can use typeof() function to return the type of an object.

Data Types	Output
Logical	TRUE,FALSE
Numeric	10.5,7,845
Integer	3L,40L,4L
Complex	3+2i
Character	‘a’,‘hello’,‘13.5’
Raw	‘Hello’ is stored as 48 65 6c 6c 6f

For example:

x <-  100
typeof(x)

[1] "double"

class(x)

[1] "numeric"

y <- 100L
typeof(y)

[1] "integer"

class(y)

[1] "integer"

a <- TRUE
typeof(a)

[1] "logical"

class(a)

[1] "logical"

stringhello <- 'hello world'
typeof(stringhello)

[1] "character"

class(stringhello)

[1] "character"

3.2 Arithmetic operators

R can do basic arithmetic operators, such as:

+ — addition
- — subtraction
* — multiplication
/ — division
%% — remainder
%/% — quotients
abs() — absolute

^ — exponent
exp() — natural exponent
sqrt() — radical
log() — logarithmic
factorial() — factorial
sin(),cos(),tan() — trigonometric
choose() — combination

For examples:

add, subtract, multiply, divide and exponent

2+3*2;2^3;exp(2);2**3;(56-14)/6-4*7*10/(5^2-5);

[1] 8

[1] 8

[1] 7.389056

[1] 8

[1] -7

remainder,quotients

7%%2;7%/%2;9.5 %% (-2.7);9.5 %/% (-2.7)

[1] 1

[1] 3

[1] -1.3

[1] -4

radical,abs, trigonometric, logarithmic

sqrt(2);sqrt(8);abs(2-4);cos(4*pi);log(0)

[1] 1.414214

[1] 2.828427

[1] 2

[1] 1

[1] -Inf

factorial; choose

factorial(6);choose(5,2)

[1] 720

[1] 10

3.3 Logical operators

Logical operators mainly includes:

> — greater than
< — less than
>= — greater than/equal
<= — less than/equal
= — equal
!= — not equal

& indicates And operation which returns TRUE if both the conditions are true.
| indicates or operation which returns TRUE if any one of the conditions results in TRUE.
! indicates not operation which takes each element of the vector and gives the opposite logical value.

For example:

# numeric
x <- 100;y <- 200
y > x  # greater than

[1] TRUE

y < x  # less than

[1] FALSE

y >= x # greater than/equal

[1] TRUE

y <= x # less than/equal

[1] FALSE

x == y # equal to

[1] FALSE

x != y # not equal

[1] TRUE

# vector
x <- 1:5;y <- 2:6
y > x  # greater than

[1] TRUE TRUE TRUE TRUE TRUE

y < x  # less than

[1] FALSE FALSE FALSE FALSE FALSE

y >= x # greater than/equal

[1] TRUE TRUE TRUE TRUE TRUE

y <= x # less than/equal

[1] FALSE FALSE FALSE FALSE FALSE

x == y # equal to

[1] FALSE FALSE FALSE FALSE FALSE

x != y # not equal

[1] TRUE TRUE TRUE TRUE TRUE

10 > 20 & 10 < 20

[1] FALSE

20>20 & 10<20

[1] FALSE

20>=20 & 10<20

[1] TRUE

20>=20 & 10<20 & 20 <30

[1] TRUE

10 > 20 | 10 < 20

[1] TRUE

20>=20|10<20

[1] TRUE

10 > 20 | 10 < 2

[1] FALSE

10 > 20 | 10 < 2 | 10 > 15

[1] FALSE

!10==10

[1] FALSE

!(10==3)

[1] TRUE

These logical operations make data processing very convenient. For instance, we want get the following three data with student data set (Table 1.1-2 in textbook):

Male students with height greater than/equal 169;
Female students with age less than 14;
Students with age greater than/equal 16 or height greater than 168.

We can perform the following operations:

Load Data
Data selection
Data

student <- read.csv("./students.csv")
head(student)

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1

# Male students with height greater than/equal 169
x <- student[student$Gender == "M" & student$Height >= 169,]
x

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3

# Female students with age less than 14
y <- student[student$Gender == "F" & student$Age < 14,]
y

      Name Age Height Gender Weight
7   JACLYN  12    162      F   65.8
28  LOUISE  12    149      F   55.8
29   ALICE  13    149      F   48.6
33 BARBARA  13    147      F   50.8
35   KATIE  12    145      F   43.1
37   SUSAN  13    137      F   30.4
38    JANE  12    135      F   33.6
39  LILLIE  12    127      F   29.1

# Students with age greater than/equal 16 or height greater than 168
z <- student[student$Age >= 16 | student$Height > 168,]
z

       Name Age Height Gender Weight
1  LAWRENCE  17    172      M   78.1
2   JEFFERY  14    169      M   51.3
4   PHILLIP  16    167      M   58.1
5      KIRK  17    167      M   60.8
14   MARTHA  16    159      F   50.8
23    LINDA  17    152      F   52.7
31   MARIAN  16    147      F   52.2

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

4 Print Formatting

4.1 Print Formatting

R uses the print() function to display the variables.

For example:

x <- 10
print(x)

[1] 10

y = 'Hello world'
print(y)

[1] "Hello world"

R uses the paste() and paste0() functions to format strings and variables together for printing in a few different ways.

For example:

print(paste('hello','world'))

[1] "hello world"

print(paste("hello", "world", sep = "-"))

[1] "hello-world"

paste("hello","world", sep = ",")

[1] "hello,world"

paste(1:12, c("st", "nd", "rd", rep("th", 9)))

 [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
[10] "10 th" "11 th" "12 th"

paste('welcome','to','R')

[1] "welcome to R"

print(paste0('hello','world'))

[1] "helloworld"

print(paste0(1,'+',1,'=',2))

[1] "1+1=2"

paste0(1:12)

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

paste0(1:12, c("st", "nd", "rd", rep("th", 9)))

 [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
[11] "11th" "12th"

paste0('welcome','to','R')

[1] "welcometoR"

5 R Objects

5.1 Vectors

A vector is a sequence of data elements of the same basic type. We usually use the following ways to create a vector:

c() — define a vector;
vector() — initialize vector;
seq() — generate regular sequences;
rep() — replicate elements of vectors.

For example: c()

x <- c(0.5,0.6)##numeric
x

[1] 0.5 0.6

x <- c(TRUE,FALSE)##logical
x

[1]  TRUE FALSE

x <- c(T,F)##logical
x

[1]  TRUE FALSE

x <- c('a','b','c')##character
x

[1] "a" "b" "c"

x <- c('red','green','yellow')##character
x

[1] "red"    "green"  "yellow"

x <- c(1+0i,2+4i)##complex
x

[1] 1+0i 2+4i

x <- c(1,2,3,4,5)##integer
x

[1] 1 2 3 4 5

x <- 9:20##integer
x

 [1]  9 10 11 12 13 14 15 16 17 18 19 20

For example: vector()

x <- vector('numeric',length = 10)
x

 [1] 0 0 0 0 0 0 0 0 0 0

x <- vector('logical',length = 10)
x

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

For example: seq()

x <- seq(1,10, by = 1)
x

 [1]  1  2  3  4  5  6  7  8  9 10

x <- seq(1,10, length.out = 10)
x

 [1]  1  2  3  4  5  6  7  8  9 10

For example: rep()

x <- rep(0, length.out = 10)
x

 [1] 0 0 0 0 0 0 0 0 0 0

x <- rep(NA, length.out = 10)
x

 [1] NA NA NA NA NA NA NA NA NA NA

class() — Identify what class this vector belongs to.

For examples:

# Integer vector
num <- 1:10
num

 [1]  1  2  3  4  5  6  7  8  9 10

class(num)

[1] "integer"

# Numeric vector, it has a float, 10.5
num <- c(1:10,10.5)
num

 [1]  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0 10.0 10.5

class(num)

[1] "numeric"

# Character vector
ltrs <- letters[1:10]
ltrs

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

class(ltrs)

[1] "character"

# Factor vector
fac <- as.factor(ltrs)
fac

 [1] a b c d e f g h i j
Levels: a b c d e f g h i j

class(fac)

[1] "factor"

If we merge two vectors with different types, they will eventually be unified into one type. For example:

#Create a vector of number
numbers <- c(1,2,3,4,5,6)
class(numbers)

[1] "numeric"

#Create a vector of letters
ltrs <- c('a','b','c','d')
class(ltrs)

[1] "character"

# concatenating the both above
mixed_vec <- c(numbers,ltrs)
# vector mixed_vec has coerced the numbers to character
print(mixed_vec)

 [1] "1" "2" "3" "4" "5" "6" "a" "b" "c" "d"

class(mixed_vec)

[1] "character"

As we can see, the numeric vector becomes characteristic vector.

Basic operations on vectors.

For examples:

x <- c(1,2,3,4)
y <- c(5,6,2,1)
# addition
x+y

[1] 6 8 5 5

# subtraction
y-x

[1]  4  4 -1 -3

# multiplication
x*y

[1]  5 12  6  4

# division
y/x

[1] 5.0000000 3.0000000 0.6666667 0.2500000

# exponent
y^x

[1]  5 36  8  1

# sum
sum(x)

[1] 10

# Cumulative sums
cumsum(x)

[1]  1  3  6 10

# mean
mean(x)

[1] 2.5

# variance
var(x)

[1] 1.666667

# standard variance
sd(x)

[1] 1.290994

Objects can be explicitly coerced from one class to another using the as. function.

x <- 0:6
class(x)

[1] "integer"

as.numeric(x)

[1] 0 1 2 3 4 5 6

as.logical(x)

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

as.character(x)

[1] "0" "1" "2" "3" "4" "5" "6"

If R cannot figure out how to coerce an object, this can result in NAs being produced.

x <- c('a','b','c')
class(x)

[1] "character"

as.numeric(x)

[1] NA NA NA

as.logical(x)

[1] NA NA NA

as.complex(x)

[1] NA NA NA

R objects have attributes(metadata for object). Example of R object attributes (names, dimnames, dimensions[matrices,arrays], class[integer,numeric],length, other user-defined attributes/metadata).

Not all R objects contain attributes, in which case the attributes() function return NULL.

x <- c(1,2,3,5)
attributes(x)

NULL

y <- 1
attributes(y)

NULL

5.2 Matrix

matrix() — to create a Matrix;
- nrow and ncol decide dim of a matrix;
- byrow = T means to arrange by row while byrow = F by col.

For example:

Create a Matrix;

m <- matrix(1:9, nrow=3, ncol = 3, byrow = T)
m

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

matrix(1:6, nrow = 2)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

matrix(1:6, nrow = 2,byrow = T)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

# The elements in a matrix may not necessarily be numerical.
m1 <- matrix(LETTERS[1:6],nrow = 4,ncol = 3)
m1

     [,1] [,2] [,3]
[1,] "A"  "E"  "C" 
[2,] "B"  "F"  "D" 
[3,] "C"  "A"  "E" 
[4,] "D"  "B"  "F"

m2 <- matrix(c("复","旦","大","学"),nrow = 4,ncol = 4)
m2

     [,1] [,2] [,3] [,4]
[1,] "复" "复" "复" "复"
[2,] "旦" "旦" "旦" "旦"
[3,] "大" "大" "大" "大"
[4,] "学" "学" "学" "学"

# We can name rows and cols by 'rownames' and 'colnames'.
n = matrix(1:6,byrow = T,nrow = 2)
rownames(n) = c('row1','row2')
n

     [,1] [,2] [,3]
row1    1    2    3
row2    4    5    6

colnames(n) = c('col1','col2','col3')
n

     col1 col2 col3
row1    1    2    3
row2    4    5    6

cbind() can connect two or more vectors or matrices in columns to form a new matrix;
rbind() can connect two or more vectors or matrices in rows to form a new matrix.

cbind(1:3,1:3)

     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3

rbind(1:3,1:3)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3

rbind(n,7:9)

     col1 col2 col3
row1    1    2    3
row2    4    5    6
        7    8    9

cbind(n,c(10,11,12))

     col1 col2 col3   
row1    1    2    3 10
row2    4    5    6 11

Matrix mathematical operations.

A <- matrix(c(10,8,5,12), nrow = 2, byrow = TRUE)
A

     [,1] [,2]
[1,]   10    8
[2,]    5   12

B <- matrix(c(5,3,15,16), nrow = 2, byrow = TRUE)
B

     [,1] [,2]
[1,]    5    3
[2,]   15   16

# Dimension
dim(A)

[1] 2 2

# Addition
A+B

     [,1] [,2]
[1,]   15   11
[2,]   20   28

# Subtraction
A-B

     [,1] [,2]
[1,]    5    5
[2,]  -10   -4

# Inner product
A*B

     [,1] [,2]
[1,]   50   24
[2,]   75  192

# Inverse matrix
solve(A)

        [,1]   [,2]
[1,]  0.1500 -0.100
[2,] -0.0625  0.125

# Diagonal element
diag(A)

[1] 10 12

x <- matrix(1:6, nrow = 2)
x

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

y <- matrix(10:15, nrow = 3)
y

     [,1] [,2]
[1,]   10   13
[2,]   11   14
[3,]   12   15

# Transpose
t(x)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

# Multiplication
x %*% y

     [,1] [,2]
[1,]  103  130
[2,]  136  172

z <- y %*% x
z

     [,1] [,2] [,3]
[1,]   36   82  128
[2,]   39   89  139
[3,]   42   96  150

# Determinant
det(z)

[1] 0

# Eigenvalues and eigenvectors
eigen(z)

eigen() decomposition
$values
[1] 2.748690e+02 1.309715e-01 4.458832e-15

$vectors
           [,1]       [,2]       [,3]
[1,] -0.5308239 -0.8834752  0.4082483
[2,] -0.5761623 -0.2408149 -0.8164966
[3,] -0.6215006  0.4018454  0.4082483

apply() functional family in matrix operations.
- MARGIN = 1 indicates calculation by row;
- MARGIN = 2 indicates calculation by column.

X <- matrix(1:30, nrow = 5)
X

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6   11   16   21   26
[2,]    2    7   12   17   22   27
[3,]    3    8   13   18   23   28
[4,]    4    9   14   19   24   29
[5,]    5   10   15   20   25   30

# Calculate the sum of each row
apply(X,1,sum)

[1]  81  87  93  99 105

# Calculate the length of each col
apply(X,2,length)

[1] 5 5 5 5 5 5

# Calculate the mean of each col
apply(X,2,mean)

[1]  3  8 13 18 23 28

# Calculate the sd of each col
apply(X,1,sd)

[1] 9.354143 9.354143 9.354143 9.354143 9.354143

# Defined Functions
apply(X,1, function(s) sd(s)/sqrt(length(s)))

[1] 3.818813 3.818813 3.818813 3.818813 3.818813

5.3 Data frame

data.frame() to create a data frame to store data in form of table.

For example:

Create a data frame.

BMI <- data.frame(gneder = c('Male','Male','Female'),height = c(152,171.5,165),weight = c(81,93,78),Age = c(42,38,26))
print(BMI)

  gneder height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26

str(BMI)

'data.frame':   3 obs. of  4 variables:
 $ gneder: chr  "Male" "Male" "Female"
 $ height: num  152 172 165
 $ weight: num  81 93 78
 $ Age   : num  42 38 26

# We also can create a data frame as follows:
name <- c('john','peter','patrick','julie','bob')
age <- c(28,30,31,38,35)
children <- c(FALSE,TRUE,TRUE,FALSE,TRUE)
df <- data.frame(Name = name, Age = age, Children = children)
df

     Name Age Children
1    john  28    FALSE
2   peter  30     TRUE
3 patrick  31     TRUE
4   julie  38    FALSE
5     bob  35     TRUE

Extract specified elements.

df[3,2]

[1] 31

df[3,]

     Name Age Children
3 patrick  31     TRUE

df[2]

df$Age

[1] 28 30 31 38 35

df[3,'Age']

[1] 31

df['Age']

df[c(3,5),c('Age','Children')]

  Age Children
3  31     TRUE
5  35     TRUE

Add new elements.

# We can define a new col and add it to original data frame.
height <- c(163,177,163,162,157)
df$Height <- height
df

     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
4   julie  38    FALSE    162
5     bob  35     TRUE    157

# Also we can use 'cbind' to add a new col.
weight <- c(75,65,54,34,78)
cbind(df,weight)

     Name Age Children Height weight
1    john  28    FALSE    163     75
2   peter  30     TRUE    177     65
3 patrick  31     TRUE    163     54
4   julie  38    FALSE    162     34
5     bob  35     TRUE    157     78

# Add a new row is similar.
tom = data.frame(Name='Tom',Age = 36, Children = FALSE, Height = 182)
rbind(df,tom)

     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
4   julie  38    FALSE    162
5     bob  35     TRUE    157
6     Tom  36    FALSE    182

Sort in one specific col.

# We can sort the data frame in a specified col.
sort(df$Age) # Sort Age from small to large only.

[1] 28 30 31 35 38

ranks = order(df$Age) # Return the order of specified age.
ranks

[1] 1 2 3 5 4

df[ranks,] # Return a sorted data frame in Age from small to large.

     Name Age Children Height
1    john  28    FALSE    163
2   peter  30     TRUE    177
3 patrick  31     TRUE    163
5     bob  35     TRUE    157
4   julie  38    FALSE    162

# Descend order
df[order(df$Age,decreasing = TRUE),]

     Name Age Children Height
4   julie  38    FALSE    162
5     bob  35     TRUE    157
3 patrick  31     TRUE    163
2   peter  30     TRUE    177
1    john  28    FALSE    163

5.4 List

list() — create a list.
- list can contain object of different types;
- list is different from vector.

For example:

Create a list

# We create a list including numerical vector, character vector and logical vector.
list1 <- list(x <- seq(10,30,10),y <- c('a','b','c'),z <- c(TRUE,FALSE))
list1

[[1]]
[1] 10 20 30

[[2]]
[1] "a" "b" "c"

[[3]]
[1]  TRUE FALSE

names(list1) <- c("number","letter","logical")
list1

$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE

# We can extract elements in a list
list1[[1]][2] # Extract the second element in first element

[1] 20

list1[[2]] # Extract the second element

[1] "a" "b" "c"

# Another way to create a list with name
list2 <- list(number = c(10,20,30), letter= c("a","b","c"), logical=c(TRUE, FALSE))
list2

$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE

# View the structure of list
str(list2)

List of 3
 $ number : num [1:3] 10 20 30
 $ letter : chr [1:3] "a" "b" "c"
 $ logical: logi [1:2] TRUE FALSE

# Elements in list also could be a list
list3 <- list(number = c(10,20,30), letter= c("a","b","c"), logical=c(TRUE, FALSE),list=list2)
list3

$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE

$list
$list$number
[1] 10 20 30

$list$letter
[1] "a" "b" "c"

$list$logical
[1]  TRUE FALSE

Extract element in a list

# We also can extract element in a list by using [[]]
list3[[2]]

[1] "a" "b" "c"

list3[[4]]

$number
[1] 10 20 30

$letter
[1] "a" "b" "c"

$logical
[1]  TRUE FALSE

# We also can extract element as follows.
list3[['number']]

[1] 10 20 30

list3[c(FALSE,TRUE,FALSE,TRUE)]#TRUE for selecting, FALSE for not selecting

$letter
[1] "a" "b" "c"

$list
$list$number
[1] 10 20 30

$list$letter
[1] "a" "b" "c"

$list$logical
[1]  TRUE FALSE

list3$number

[1] 10 20 30

Comparison between vector and list

list1 <- list('music tracks',100,5)
list1

[[1]]
[1] "music tracks"

[[2]]
[1] 100

[[3]]
[1] 5

class(list1)

[1] "list"

is.list(list1)

[1] TRUE

vec <- c('music tracks',100,5)
vec

[1] "music tracks" "100"          "5"

class(vec)

[1] "character"

is.list(vec)

[1] FALSE

6 Flow Control

6.1 Conditional statement

If statement consists of a Boolean expression followed by one or more statements.

The format is as follows:

if(boolean_expression) {
  statement will execute if the boolean expression is true.
}

For example: If X is integer we will print ‘X is an Integer’ and if X is character we will print ‘X is an character’.

x = 30L
typeof(x)

[1] "integer"

if (is.integer(x)){
  print('X is an Integer')
}

[1] "X is an Integer"

if (is.character(x)){
  print('X is an character')
}

else statement is ececuted when the condition in the if statement results to false.

The format is as follows:

if(boolean_expression) {
  statement will execute if the boolean expression is true.
}else{
  statement will execute if the boolean expression is false.
}

For example: we divide score into multiple levels,

score = 60
if (score >= 80) {
  print('Good Score!')
} else if (score >= 60 & score < 80) {
  print('Decent Score!')
} else if (score < 60 & score >=33) {
  print('Average Score!')
} else {
  print('Poor!')
}

[1] "Decent Score!"

6.2 Loop Statement

while loop: An else statement is executed when the condition in the if statement results to false.

The format is as follows:

while (test_expression) {
  statement
}

For example:

Print ‘Hello World’ more times.

v = c('Hello World')
count = 2
while (count < 7) {
  print(v)
  count = count + 1
}

[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"

Calculate the sum from 1 to 100.

i=1
sum = 0
while (i<101) {
  sum = sum+i
  i = i+1
}
print(sum)

[1] 5050

for loop: A for loop is used to integrate over a list of elements or a range of numbers.

The format is as follows:

for (value in vector){
  statements
}

For example:

Print fruits.

fruit = c('Apple','Orange','Passion fruit','Banana')
for (i in fruit){
  print(i)
}

[1] "Apple"
[1] "Orange"
[1] "Passion fruit"
[1] "Banana"

Calculate sum from 1 to 100.

sum = 0 
for (i in 1:100) {
  sum = sum + i
}
print(sum)

[1] 5050

7 Functions

7.1 Custom Functions

A function is a set of statements to perform a specific task. R has a large number of in-built functions and the user can create their own functions.

Its format is as follows:

function_name <- function(arg_1, arg_2, ...){
  Function body
}

For example:

Calculate square from 1 to 4.

squares <- function(a){
  for (i in 1:a){
    b = i^2
    print(b)
  }
}
squares(4)

[1] 1
[1] 4
[1] 9
[1] 16

Calculate sum from 1 to 100.

Sum <- function(n){
  sum <- 0
  for (i in 1:n) {
    sum = sum + i
  }
  print(sum)
}
Sum(100)

[1] 5050

7.2 Bulit-in R functions

R supports a lot of built-in functions to work with data structures. For example:

seq() – create sequences.

seq(1,10,by=2)

[1] 1 3 5 7 9

append() – combine objects.

v <- c(11,4,5,7,3,10,2)
v2 <- c(1,2,3,4,5)
append(v, v2)

 [1] 11  4  5  7  3 10  2  1  2  3  4  5

sort() – sort sequences.

v <- c(11,4,5,7,3,10,2)
sort(v)

[1]  2  3  4  5  7 10 11

sort(v,decreasing = T)

[1] 11 10  7  5  4  3  2

order() – return index of sorted vectors

ranks <- order(v)
ranks

[1] 7 5 2 3 4 6 1

v[ranks]

[1]  2  3  4  5  7 10 11

rank() – return ranking of elements

[1] 11  4  5  7  3 10  2

rank(v)

[1] 7 3 4 5 2 6 1

rev() – reverse elements in R objects

v2 <- c(1,2,3,4,5)
rev(v2)

[1] 5 4 3 2 1

R provides various mathematical functions to perform mathematical calculations. For example:

abs() – absolute of a number.

abs(-5.99)

[1] 5.99

abs(-0.002)

[1] 0.002

sqrt() – square root of a number.

sqrt(121)

[1] 11

sum() – sum of a sequences.

sum(1:100)

[1] 5050

floor() – round down.

floor(5.99)

[1] 5

ceiling()– round up.

ceiling(5.99)

[1] 6

round() – rounding

round(5.5)

[1] 6

round(5.1)

[1] 5

sign() – sign function

sign(-0.2)

[1] -1

sign(2)

[1] 1

max() – maximum value

vec1 <- rnorm(10,0,1)
vec2 <- runif(10,0,1)
max(vec1)

[1] 2.082767

min() – minimum value

min(vec2)

[1] 0.03556776

pmax() – maximum value in each pair

pmax(vec1,vec2)

 [1] 0.4684160 0.3739669 0.8203838 0.8803938 2.0827673 0.7684691 1.6771109
 [8] 0.2103584 0.5292005 0.6926519

pmin() – minimum value in each pair

pmin(vec1,vec2)

 [1]  0.4452633597  0.3703759535  0.6707883861  0.5288044549  0.0355677570
 [6] -0.6783278153  0.6712036373 -0.8646459158  0.3169878945  0.0001665582

which.max() – index of maximum

which.max(vec1)

[1] 5

which.min() – index of minimum

which.min(vec1)

[1] 8

which() – find index

which(vec1>0)

[1]  1  2  3  4  5  7  9 10

sample() – random sampling

sample(1:10, 3, replace = TRUE)

[1] 8 2 1

sample(LETTERS[1:6], 3, replace = FALSE)

[1] "A" "F" "B"

8 Data Manipulation

8.1 Data Manipulation with `dplyr`

dplyr – A package is used to transform and summarize tabular data with rows and columns.
- install.packages("dplyr")
- library(dplyr)

There are some frequently-used built-in functions in dplyr, such as:

select()
fillter()
arrange()
summarise()

mutate()
transmute()
group_by()

We will use student data set mentioned above as an example to illustrate the convenience of dplyr. The structure of this data set is as follows:

library(dplyr)
student <- read.csv("./students.csv") # read data set
head(student) # Just show first 6 rows.

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1

select() – Selects column variables based on their names.

# We select column 'Height' in student.
s1 <- select(student, Height)
head(s1)

# We select column 'Weight' in student.
s2 <- select(student, Weight)
head(s2)

fillter() – Filter rows based on their values.

# We extract all male students.
f1 <- filter(student,Gender=="M")
head(f1)

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1

# We extract female students with height > 155.
f2 <- filter(student,Gender=="F", Height>155)
head(f2)

    Name Age Height Gender Weight
1 JACLYN  12    162      F   65.8
2 LESLIE  14    159      F   64.5
3 MARTHA  16    159      F   50.8
4    AMY  15    157      F   50.8

arrange() – Changes the ordering of the rows.

# We will arrange the students in ascending order of Height.
a1 <- arrange(student,Height)
head(a1)

     Name Age Height Gender Weight
1  ROBERT  12    125      M   35.9
2  LILLIE  12    127      F   29.1
3    JANE  12    135      F   33.6
4   SUSAN  13    137      F   30.4
5 MICHAEL  13    142      M   43.1
6   DAVID  13    145      M   35.9

# We will arrange the students in descending order of Weight.
a2 <- arrange(student,desc(Weight))
head(a2)

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2   JACLYN  12    162      F   65.8
3   LESLIE  14    159      F   64.5
4     KIRK  17    167      M   60.8
5  PHILLIP  16    167      M   58.1
6   ROBERT  15    164      M   58.1

summarise() – Reduces multiple values down to a single summary.

# We want obtain the average of Height
summarise(student,avg_Height=mean(Height,na.rm=T))

  avg_Height
1     153.25

# We want obtain the sum of Height
summarise(student,tot_Height=sum(Height,na.rm=T))

  tot_Height
1       6130

# We want obtain the std of Weight
summarise(student,stdev_Weight=sd(Weight,na.rm=T))

  stdev_Weight
1     10.07415

# We want obtain the average and sum of Weight simultaneously
summarise(student,avg_Weight=mean(Weight,na.rm=T),tot_Weight=sum(Weight,na.rm=T))

  avg_Weight tot_Weight
1    47.6625     1906.5

mutate() – Creates columns that are functions of existing variables.

# We will add a new column to the original data set and name this column 'BMI' 
# where BMI = Weight/Height^2.
student1 <- mutate(student, BMI = Weight / (0.01*Height)^2)
head(student1)

      Name Age Height Gender Weight      BMI
1 LAWRENCE  17    172      M   78.1 26.39941
2  JEFFERY  14    169      M   51.3 17.96156
3   EDWARD  14    167      M   50.8 18.21507
4  PHILLIP  16    167      M   58.1 20.83259
5     KIRK  17    167      M   60.8 21.80071
6   ROBERT  15    164      M   58.1 21.60173

transmute() – Used to show only new column.

# Same example as above.
student2 <- transmute(student, BMI = Weight / (0.01*Height)^2)
head(student2)

group_by() – Used to group the data set by some column. But usually it works with other functions.

# We group by Gender in student and calculate the sum and mean of Height for each group.
by_type <- group_by(student, Gender)
summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))

# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.

As we can see, the student data set is divided into 2 groups by Gender.

8.2 Usage of pipe operator `%>%`

The pipe symbol %>% in the dplyr package can take the output of the previous function as the input of the next function, which makes the operation more convenient. Let’s see some examples.

Example1: We group by Gender in student and calculate the sum and mean of Height for each group.

# Without pipe %>%.
by_type <- group_by(student, Gender)
summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))

# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.

# With pipe %>%.
student %>% group_by(Gender) %>% summarise(Height_sum=sum(Height),Height_mean=mean(Height))

# A tibble: 2 × 3
  Gender Height_sum Height_mean
  <chr>       <int>       <dbl>
1 F            2684        149.
2 M            3446        157.

Example2: we screen out all male students, and sampling 5 samples and arrange them in descend order of Height.

# Without pipe %>%.
f <- filter(student,Gender == "M")
s <- sample_n(f,size = 5)
a <- arrange(s,desc(Height))
a

     Name Age Height Gender Weight
1  EDWARD  14    167      M   50.8
2 PHILLIP  16    167      M   58.1
3   HENRY  14    159      M   54.0
4  ALFRED  14    157      M   44.9
5     JOE  13    154      M   47.7

# With pipe %>%.
a <- student %>% filter(Gender == "M") %>% sample_n(size = 5) %>% arrange(desc(Height))
a

    Name Age Height Gender Weight
1  CHRIS  14    157      M   44.9
2  LEWIS  14    157      M   41.8
3   MARK  15    152      M   47.2
4    TIM  12    147      M   38.1
5 ROBERT  12    125      M   35.9

8.3 Data Manipulation with `tidyr`

tidyr – A package helps ue create tidy data. A tidy data is easy to visualize and model.
- install.packages("tidyr")
- library(tidyr)

There are some frequently-used built-in functions in tidyr, such as:

gather()
separate()

unite()
spread()

We first to create a data frame:

library(tidyr)
data <- data.frame(
  ID = c(1:10),
  Face_1 = c(411,723,325,456,579,612,709,513,527,379),
  Face_2 = c(123,300,400,500,600,654,789,906,413,567),
  Face_3 = c(1457,1000,569,896,956,2345,780,599,1023,678)
)
print(data)

   ID Face_1 Face_2 Face_3
1   1    411    123   1457
2   2    723    300   1000
3   3    325    400    569
4   4    456    500    896
5   5    579    600    956
6   6    612    654   2345
7   7    709    789    780
8   8    513    906    599
9   9    527    413   1023
10 10    379    567    678

gather() – Reshape data from wide format to long format.

# We elongate the second column 'Face_1' to the fourth column 'Face_3' 
# into one column and name them 'Face' and 'ResponseTime'.
long <- data %>% gather(Face, ResponseTime, Face_1:Face_3)
long

   ID   Face ResponseTime
1   1 Face_1          411
2   2 Face_1          723
3   3 Face_1          325
4   4 Face_1          456
5   5 Face_1          579
6   6 Face_1          612
7   7 Face_1          709
8   8 Face_1          513
9   9 Face_1          527
10 10 Face_1          379
11  1 Face_2          123
12  2 Face_2          300
13  3 Face_2          400
14  4 Face_2          500
15  5 Face_2          600
16  6 Face_2          654
17  7 Face_2          789
18  8 Face_2          906
19  9 Face_2          413
20 10 Face_2          567
21  1 Face_3         1457
22  2 Face_3         1000
23  3 Face_3          569
24  4 Face_3          896
25  5 Face_3          956
26  6 Face_3         2345
27  7 Face_3          780
28  8 Face_3          599
29  9 Face_3         1023
30 10 Face_3          678

separate() – Split a single column into multiple columns.

# We divide the column 'Face' of 'long' data set into two columns,
# and name them 'Target' and 'Number'.
long_separate <- long %>% separate(Face, c("Target", "Number"), sep = "_")
long_separate

   ID Target Number ResponseTime
1   1   Face      1          411
2   2   Face      1          723
3   3   Face      1          325
4   4   Face      1          456
5   5   Face      1          579
6   6   Face      1          612
7   7   Face      1          709
8   8   Face      1          513
9   9   Face      1          527
10 10   Face      1          379
11  1   Face      2          123
12  2   Face      2          300
13  3   Face      2          400
14  4   Face      2          500
15  5   Face      2          600
16  6   Face      2          654
17  7   Face      2          789
18  8   Face      2          906
19  9   Face      2          413
20 10   Face      2          567
21  1   Face      3         1457
22  2   Face      3         1000
23  3   Face      3          569
24  4   Face      3          896
25  5   Face      3          956
26  6   Face      3         2345
27  7   Face      3          780
28  8   Face      3          599
29  9   Face      3         1023
30 10   Face      3          678

unite() – Combine multiple columns into a single column.

# We combine the columns 'Target' and 'Number' of 'long_separate' data set 
# into a single column, we name it 'Face'.
long_unite <- long_separate %>% unite(Face, Target, Number, sep = "_")
long_unite

   ID   Face ResponseTime
1   1 Face_1          411
2   2 Face_1          723
3   3 Face_1          325
4   4 Face_1          456
5   5 Face_1          579
6   6 Face_1          612
7   7 Face_1          709
8   8 Face_1          513
9   9 Face_1          527
10 10 Face_1          379
11  1 Face_2          123
12  2 Face_2          300
13  3 Face_2          400
14  4 Face_2          500
15  5 Face_2          600
16  6 Face_2          654
17  7 Face_2          789
18  8 Face_2          906
19  9 Face_2          413
20 10 Face_2          567
21  1 Face_3         1457
22  2 Face_3         1000
23  3 Face_3          569
24  4 Face_3          896
25  5 Face_3          956
26  6 Face_3         2345
27  7 Face_3          780
28  8 Face_3          599
29  9 Face_3         1023
30 10 Face_3          678

spread() – Take two columns (key & value) and spreads in to multiple columns.

# We split the column 'Face' of 'long_unite' data set back into three 
# columns 'Face_1', 'Face_2' and 'Face_3'.
back_to_data <- long_unite %>% spread(Face, ResponseTime)
back_to_data

   ID Face_1 Face_2 Face_3
1   1    411    123   1457
2   2    723    300   1000
3   3    325    400    569
4   4    456    500    896
5   5    579    600    956
6   6    612    654   2345
7   7    709    789    780
8   8    513    906    599
9   9    527    413   1023
10 10    379    567    678

9 Data Visualization

9.1 Data Visualization in R

R supports a variety of functions and data visualization packages to build interactive visuals for exploratory data analysis.

plot() – It use more of a generic function for plotting R objects.
barplot() – It is used plot data using rectangular bars.
hist() – It is used to create histograms.
boxplot() – It is used to represent data in the form of quartiles.
ggplot() – This package enables the users to create sophisticated visualizations with little code using the Grammar of Graphics.
plotly() – It create interactive web-based graphs via the open source JavaScript library plotly.js.

Next, we will mainly introduce the usage of plot(), and the barplot(), hist(), boxplot() are introduced in the next section. The other two ggplot() and plotly() are left to students to explore.

We also use student data set to illustrate plot() function.

student <- read.csv("./students.csv")

Scatter-plot
Line-plot
Multiple plots
Data

# We first draw a scatter-plot for Height vs Weight.
plot(student$Height,
     student$Weight,
     type = "p", # type of plot: points 
     pch = 16, # shape of scattered points
     xlab = "Height",
     ylab = "Weight",
     xlim = c(120,175),
     ylim = c(20,80),
     main = "Hieght vs Weight",
     col = "red")

# We next draw a Line-plot for Height.
plot(student$Height,
     type = "l", # type of plot: line 
     lty = 1, # type of line
     lwd = 3, # width of line
     xlab = "Student",
     ylab = "Height",
     main = "Hieght of Student",
     ylim = c(120,175),
     col = "blue")

# We draw three lines for Height, Weight and Age.
# standardization
height <- (student$Height - mean(student$Height))/sd(student$Height)
weight <- (student$Weight - mean(student$Weight))/sd(student$Weight)
age <- student$Age - mean(student$Age)
plot(height,
     type = "l", # type of plot: line 
     lty = 2, # type of Dotted line
     lwd = 3, # width of line
     xlab = "Student",
     ylab = "Values",
     main = "Height, Weight and Age of Student",
     col = "blue")
lines(weight,
     type = "l", # type of plot: line 
     lty = 1, # type of line
     lwd = 3, # width of line
     col = "red")
points(age, pch = 16, col = "brown")
legend('topright',
       legend = c('Height','Weight','Age'),
       col = c('blue','red','brown'),
       lty = c(3,1,0), 
       pch = c(NA,NA,16),
       lwd = c(3,3,2),
       ncol = 1)

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

10 Data Description

10.1 Data Set

10.1.1 Data Table

In statistics analysis, the data is often a description of several individuals (objects), and the description of each individual is a number of indicators (characteristic) of our concern. For example, we conducted a survey on students in a class, denote the \(n\) indicators of \(i\)-th object as \[ x_{i1},x_{i2},\cdots ,x_{in}. \] There are a total of \(m \times n\) data in the records of \(m\) students, which can be recorded in table:

Number	variable1	variable2	\(\cdots\)	variable\(n\)
1	\(x_{11}\)	\(x_{12}\)	\(\cdots\)	\(x_{1n}\)
2	\(x_{21}\)	\(x_{22}\)	\(\cdots\)	\(x_{2n}\)
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(m\)	\(x_{m1}\)	\(x_{m2}\)	\(\cdots\)	\(x_{mn}\)

The following table shows the information including Name, Height, Age, Gender, Weight of a 40 students class: (Table 1.1-2 in textbook)

Load Data
Data

student <- read.csv("./students.csv")
head(student)

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8
6   ROBERT  15    164      M   58.1

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

10.1.2 Data File

The data recording each attribute of each observation is stored in computer in a data file with a specific format. These data files have different names on different occasions.

In Mathmatics-It it is a matrix;
In Database-It is a table;
In Statistic-It is records of observation valus;
In R-It is data frame.

Each row of this matrix is called a record or observation, and it records the indicators values of various characteristics of the observation object.

Each column of the matrix is called a field or variable, and it records the indicator value of a certain characteristic of all observed objects.

Generally, data in the same column is required to have the same attributes. For example, the data in a column indicating Gender can only use characters M or F while only number for Height column.

10.1.3 Table Merging

In statistical analysis, it is often necessary to collect individual characteristics of the research object into one table. In RStudio, there three commonly used functions to merge data frames.

rbind()-To merge data frames with same variables vertically.

data1 <- student[1:5,]
data1

      Name Age Height Gender Weight
1 LAWRENCE  17    172      M   78.1
2  JEFFERY  14    169      M   51.3
3   EDWARD  14    167      M   50.8
4  PHILLIP  16    167      M   58.1
5     KIRK  17    167      M   60.8

data2 <- student[9:14,]
data2

      Name Age Height Gender Weight
9     CLAY  15    162      M   47.7
10   HENRY  14    159      M   54.0
11  LESLIE  14    159      F   64.5
12    JOHN  13    159      M   44.5
13 WILLIAM  15    159      M   50.4
14  MARTHA  16    159      F   50.8

data3 <- rbind(data1,data2)
data3

       Name Age Height Gender Weight
1  LAWRENCE  17    172      M   78.1
2   JEFFERY  14    169      M   51.3
3    EDWARD  14    167      M   50.8
4   PHILLIP  16    167      M   58.1
5      KIRK  17    167      M   60.8
9      CLAY  15    162      M   47.7
10    HENRY  14    159      M   54.0
11   LESLIE  14    159      F   64.5
12     JOHN  13    159      M   44.5
13  WILLIAM  15    159      M   50.4
14   MARTHA  16    159      F   50.8

cbind()-To merge data frames with same number of rows horizontally.

Grades <- c(88,78,90,65,73)
Level <- c("A","B","A","C","B")
data4 <- data.frame(Grades,Level)
data4

  Grades Level
1     88     A
2     78     B
3     90     A
4     65     C
5     73     B

data5 <- cbind(data1,data4)
data5

      Name Age Height Gender Weight Grades Level
1 LAWRENCE  17    172      M   78.1     88     A
2  JEFFERY  14    169      M   51.3     78     B
3   EDWARD  14    167      M   50.8     90     A
4  PHILLIP  16    167      M   58.1     65     C
5     KIRK  17    167      M   60.8     73     B

merge()-To merge data frames based on common variables.

Name <- c("LAWRENCE","JEFFERY", "EDWARD", "PHILLIP", "KIRK")
Grades <- c(88,78,90,65,73)
Level <- c("A","B","A","C","B")
data6 <- data.frame(Name,Grades,Level)
data6

      Name Grades Level
1 LAWRENCE     88     A
2  JEFFERY     78     B
3   EDWARD     90     A
4  PHILLIP     65     C
5     KIRK     73     B

data7 <- merge(data1, data6, by = "Name")
data7

      Name Age Height Gender Weight Grades Level
1   EDWARD  14    167      M   50.8     90     A
2  JEFFERY  14    169      M   51.3     78     B
3     KIRK  17    167      M   60.8     73     B
4 LAWRENCE  17    172      M   78.1     88     A
5  PHILLIP  16    167      M   58.1     65     C

10.1.4 Special Value

During the data collection, sometimes, special values can occur due to certain reasons. In R, there are three special values.

Inf — Infinity;
NaN — Not a Number;
NA — Not Available (Missing value).

some calculation about these special values.

2/0;2/Inf;exp(-Inf);Inf-Inf;Inf/Inf;0/0;

[1] Inf

[1] 0

[1] 0

[1] NaN

[1] NaN

[1] NaN

is.na(),is.nan() — To identify NA and NaN in a data set.

x <- c(5,6,8,NA,23,NA,9,2)
is.na(x) # Identify missing values

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

which(is.na(x)) # Identify the index of missing value

[1] 4 6

x[! is.na(x)] # Remove missing values

[1]  5  6  8 23  9  2

sum(is.na(x)) # Count missing values

[1] 2

y <- c(6,6,NaN,0,23,3,9,NaN)
is.nan(y) # Identify NaN

[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

which(is.nan(y)) # Identify the index of NaN

[1] 3 8

y[! is.nan(y)] # Remove NaN

[1]  6  6  0 23  3  9

sum(is.nan(y)) # Count NaN

[1] 2

na.omit(), na.rm(), -To remove special values.

mean(x, na.rm = TRUE)

[1] 8.833333

mean(y, na.rm = TRUE)

[1] 7.833333

mean(na.omit(x))

[1] 8.833333

mean(na.omit(y))

[1] 7.833333

10.1.5 Type of Variables

According to meaning and measurement, it can be divided into quantitative and qualitative: \[ Quantitative= \begin{cases} Continuous \\ Discrete \end{cases} \\ Qualitative = \begin{cases} Ordinal \\ Nomibnal \end{cases} \]

Quantitative: Height, weight, age etc. The value of such variable is a quantity, and it is meaningful to perform arithmetic operations on its value.
- Continuous: The possible values of such variables can fill the entire interval, such as height and weight, which also called interval and real.
- Discrete: This type of variable can only take on a limited number of values, such as age, etc.
Qualitative: Gender, variety, province etc. Different values of such variables have different meanings, and their different values cannot be arithmetic or have no meaning.
- Nomibnal: There is no natural order relationship between the values of such variables, such as gender, province, etc.
- Ordinal: Such as: size (extra large, large, medium, small).

Discrete, Nomibnal and Ordinal are collectively referred to as categorical.

10.2 Frequency

10.2.1 Frequency Statistics

The table contains observations for multiple variables, starting with the analysis of a single variable. Let the results of \(n\) observations of a variable: \[x_1, x_2, \cdots, x_n.\] We first need to know What values are taken and the proportions of different values, which refers to the distribution of these observation.

Frequency of Gender
Frequency of Age
Data

Gender	Frequency	Percent	Cumulative Frequency	Cumulative Percent
M	22	55%	22	55%
F	18	45%	40	100%

Age	Frequency	Percent	Cumulative Frequency	Cumulative Percent
12	8	20%	8	20%
13	7	17.5%	15	37.5%
14	12	30%	27	67.5%
15	7	17.5%	34	85%
16	3	7.5%	37	92.5%
17	3	7.5%	40	100%

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

Besides of this, we can perform frequency statistics on multiple variable value combinations.

10.2.2 Bar Plot

In addition to recording frequencies in table, we can also use bar plots where there are multiple columns with the same width juxtaposed, and the height of the column represents the frequency.

barplot()-Bar plot.
pie()-Pie plot.

# Load data
stat <- table(student$Age)
Age <- as.numeric(names(stat))
Freq <- as.numeric(stat)
age <- data.frame(Age, Freq)

Vertical bar plot
Horizontal bar plot
Pie plot
Data

# Vertical bar plot of age
col <- rainbow(6)
barplot(Freq~Age,
        data = age,
        xlab = "Age",
        ylab = "Frequency",
        col = col,
        border = "black",
        main = "Vertical bar plot of Age distribution")

# Horizontal bar plot of age
col <- rainbow(6)
barplot(Freq,
        names.arg = Age,
        xlab = "Age",
        ylab = "Frequency",
        col = col,
        border = "black",
        horiz = TRUE,
        main = "Horizontal bar plot of Age distribution")

# pie plot
Age <- as.character(Age) # Turn number to character
age <- data.frame(Age, Freq)
col = rainbow(6)
pie(age$Freq,
    labels = age$Freq,
    radius = 0.9,
    main = "Pie plot of Age distribution",
    col = col)
legend("topright", age$Age, cex = 0.8, fill = col)

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

10.2.3 Histogram

If weight is appropriately grouped based on their values and frequency statistics are conducted, the distribution information of variable values in different ranges can be obtained.

Weight	Frequency	Percent	Cumulative Frequency	Cumulative Percent
24~32	2	5%	2	5%
32~40	7	17.5%	9	22.5%
40~48	12	30%	21	52.5%
48~56	12	30%	33	82.5%
56~64	4	10%	37	92.5%
64~72	2	5%	39	97.5%
72~80	1	2.5%	40	100%

The most commonly used bar plot for calculating the frequency of interval variables after grouping is the histogram:

The width of each column in the histogram is the width of each group;
The area of each column in the histogram is the proportion of observations falling into this group, so the height of the column is the ratio of these observations divided by the width.

Width-1
Width-2
Width-3
Width-4
Data

# Histogram of weight
break1 <- c(24,32,40,48,56,64,72,80)
col <- rainbow(length(break1))
hist(student$Weight,
     breaks = break1,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break2 <- seq(20,80,5)
col <- rainbow(length(break2))
hist(student$Weight,
     breaks = break2,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break3 <- seq(20,80,10)
col <- rainbow(length(break3))
hist(student$Weight,
     breaks = break3,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

# Histogram of weight
break4 <- seq(29,79,2.5)
col <- rainbow(length(break4))
hist(student$Weight,
     breaks = break4,
     freq = FALSE,
     col = col,
     border = "black",
     xlab = "Weight",
     ylab = "Density",
     main = "Histogram of weight")

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

10.3 Moments Type

10.3.1 Summary of common statistics

Based on the method of its generation:
- Based on moments of observation: mean, variance, etc;
- Based on order statistics of observation: median, range, quantile, etc.
Based on described characteristics:
- Describes the center of distribution: mean, median, etc;
- Describe the degree of dispersion: variance, range, etc;
- Others statistics describing distribution and its shape.

10.3.2 Mean

If a set of observations gives: \(x_1,x_2, \cdots, x_n\), where \(n\) is sample size, then its mean gives that \[ \bar x = \frac{1}{n}\sum_{i=1}^{n}x_i, \] which is used to describe the central position of this set of observations.

10.3.3 Variance and Standard Variance

Two forms of variance gives that, \[ s^{*2} = s^{*2}(x) = s^{*2}_n(x)= \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x), \] \[ s^{2} = s^{2}(x) = s^{2}_n(x)= \frac{1}{n}\sum_{i=1}^n (x_i-\bar x), \] which is always used to describe the dispersion of observations. \(s^{2}_n(x)\) is biased variance and \(s^{*2}_n(x)\) is unbiased variance respectively.

Moreover:

Standard variance: \(s^{*}_n(x) = \sqrt{s^{*2}_n(x)} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)}\);
Standard error of mean: \(\frac{s}{\sqrt{n}}\);
Coefficient of variation: \(cv = \frac{s}{\bar x} \cdot 100\%\).

If \(y_i = a x_i +b\), \(1\leq i \leq n\), then \[ \bar y = a \bar x, \quad s^{*2}(y) = a^2 s^{*2}(x), \quad s^{*}(y) = |a|s^{*}(x). \] For \(x = (x_1,x_2, \cdots, x_n)\), then \[ y_i = \frac{x_i-\bar x}{s^{*}(x)}, \] is called standardization of \(x\), also called satndard score. Its mean and variance are 0 and 1 after standardization.

10.3.4 Coefficient of skewness and Kurtosis

Another two statistics related to moments are:

Coefficient of skewness or skewness: \[ g_1 = \frac{1}{ns^3}\sum_{i=1}^n(x_i-\bar x)^3, \] more precisely, \[ g_1 = \frac{n}{(n-1)(n-2)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^3. \]
Coefficient of kurtosis or kurtosis \[ g_2 = \frac{1}{ns^4}\sum_{i=1}^n(x_i-\bar x)^4-3, \] more precisely, \[ g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^4-3\frac{(n-1)^2}{(n-2)(n-3)}. \]

Notice that:

The skewness describes symmetry of the data about its center position – mean;
The skewness is close to 0 indicating the distribution is symmetrical about its mean;
A positive (negative) skewness coefficient indicates that the data has a longer tail on the right (left).

The kurtosis describe the tail of the data distribution base on the shape of the normal distribution;
The kurtosis is close to 0 as shape of the distribution is close to normal;
A positive (negative) kurtosis indicates the data distribution has thicker (thinner) tails on both sides.

10.3.5 Calculation with `R`

We use Weight data to calculate above statistics with R:

weight <- student$Weight
# mean
mean(weight)

[1] 47.6625

# variance
var(weight)

[1] 101.4886

# standard variance
sd(weight)

[1] 10.07415

# coefficient of variation
sd(weight)/mean(weight)

[1] 0.2113643

library(moments) # skewness and kurtosis
library(plotrix) # standard error of mean
# standard error of mean
std.error(weight)

[1] 1.592863

# skewness
skewness(weight)

[1] 0.5939948

# kurtosis
kurtosis(weight)

[1] 3.697771

10.4 Order Statistic Type

10.4.1 Order Statistic

Let \(x_1,x_2, \cdots, x_n\) are a set of observations, arrange them from small to large, \[ x_{1,n} \leq x_{2,n} \leq \cdots \leq x_{n,n}, \] that is, \[ \begin{cases} x_{(1)} = x_{1,n} = \min\{ x_1, x_2, \cdots, x_n\} \\ x_{(2)} = x_{2,n} = \mathop{\min}\limits_{1\leq i,j \leq n}\max\{ x_i, x_j\}=\mathop{\max}\limits_{1\leq i_1,...,i_{n-1} \leq n}\min\{ x_{i-1}, x_{i_{n-1}}\}\\ \cdots \cdots \\ x_{(n)} = x_{n,n} = \max\{ x_1, x_2, \cdots, x_n\} \end{cases} \] \((x_{(1)}, x_{(2)}, \cdots, x_{(n)})\) are called Order Statistics of these observations.

10.4.2 Median

For a set of observation \(x_1,x_2, \cdots, x_n\), median is defined as \[ m= \begin{cases} x_{\frac{n+1}{2},n}, ~n~odd, \\ \frac{1}{2}\left(x_{\frac{n}{2},n}+x_{\frac{n}{2}+1,n}\right), ~n~even, \end{cases} \] which is also a quantity describing the center of the distribution, so the number of observations greater than or less than the median is roughly half. It is easy to calculate and is not affected by extreme data (robustness).

10.4.3 Range

For observations \(x_1,x_2, \cdots, x_n\), range is defined as \[ r = x_{n,n}-x_{1,n} = \mathop{\max}\limits_{1\leq i \leq n} x_i - \mathop{\min}\limits_{1\leq i \leq n} x_i, \] which is often used to describe the dispersion.

10.4.4 Empirical Distribution

For observations \(x_1,x_2, \cdots, x_n\), we call discrete distribution: \[ \begin{pmatrix} x_1 & x_2 & \cdots & x_n \\ \frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n} \end{pmatrix} \] as empirical distibution. Its empirical distibution function gives \[ \widehat F_n(x)= \begin{cases} 0,~ x < x_{1,n}\\ \frac{k}{n}, ~x_{k,n} \leq x < x_{k+1,n},~ 1\leq k \leq n-1\\ 1,~x\ge x_{n,n} \end{cases} =\frac{1}{n}\sum_{i=1}^n I_{\{ x \ge x_i \}}, \] where \(I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)\) is indicator function of \((x \ge x_i)\), that is, \[ I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)= \begin{cases} 1,~x\ge x_i \\ 0,~ otherwise. \end{cases} \] Notice that:

mean and variance(biased) are mean and variance of empirical distribution respectively;
All moments type statistics are established based on moments of empirical distribution.

The following plot show the empirical distribution of Weight data set.

Empirical distribution completely describes the distribution of observation and is also an important tool in statistical inference and large-sample theorey.

10.4.5 Quantile

For a \(p \in (0,1)\), the \(p\)-quantile of distribution \(F\) is \(\zeta_p\) satisfying \(F(\zeta_p)=p\). The solution of \(\widehat F_n(x) = p\) may not exist or be not unique even if exist, since empirical distribution is not strictly monotonous and continuous.

For observations \(x_1,x_2, \cdots, x_n\), sample quantile will take the value, satisfying

Makes the number of observations less than \(p\)-quantile approximate \(np\);
Makes the number of observations lager than \(p\)-quantile approximate \(n(1-p)\);
A number close to \(x_{[np],n}\) and \(p\)-quantile of empirical distribution \(\widehat F\).

In practice, there are multiple ways to calculate \(p\)-quantile, but their values are closely when the sample size is too large. In R, the formula gives: \[ z_p = (1-g)x_{j,n} + gx_{j+1,n} \] where \(j = [(n-1)p]+1\) and \(g = (n-1)p-j+1\). (More ways can be seen in textbook 1.1.17)

10.4.6 Quartile and Interquartile Range

Quartile and percentile are the most commonly used Among the various quantiles.

\(\frac{i}{4}\)-quantile is called the \(i-\)th quartile, denoted as \(q_i\);
- 0.25-quantile \(q_1\) is called upper quartile;
- 0.75-quantile \(q_3\) is called lower quartile;
\(\frac{i}{100}\)-quantile is called the \(i-\)th percentile.

Moreover, IQR(interquartile range) defined as \[ IQR = q_3-q_1 \] is also used to describe the dispersion of the distribution.

10.4.7 Box plot

Boxplots are often used to represent quantiles to describe distribution information.

Boxplot consists of a rectangular box with whiskers on both sides;
The upper (lower) side of rectangular box is upper (lower) quartiles respectively, so the width of rectangular box is interquartile range;
The middle horizontal line corresponds to the median;
The whiskers on both sides of box represent the positions of the farthest data points extending from the edge of quartile to 1.5 times the IQR;
The top (bottom) line corresponds to max (min) value;
Extreme data will lie out of this range.

We also use student dataset to show some boxplots:

Boxplot-1
Boxplot-2
Boxplot-3
Data

# Weight and Height
Weight <- student$Weight
Height <- student$Height
# Standardization
weight <- (Weight-mean(Weight))/sd(Weight)
height <- (Height-mean(Height))/sd(Height)
col <- rainbow(2)
boxplot(weight, 
        height,
        names = c("Weight","Height"),
        col = col,
        main = "Weight and Height for students")

# Weight for different gender
col <- rainbow(2)
boxplot(Weight~Gender,
        data = student,
        col = col,
        main = "Weight with diffrent gender")

# Weight and Height for different gender
gender <- student$Gender
boxplot(weight~gender,
        col = "red",
        boxwex = 0.3, # set width of box
        at = 1:2+0.2, # set position of box on x-axis
        ylab = "Values",
        main = "Weight and Height for different gender")
boxplot(height~gender,
        col = "blue",
        boxwex = 0.3,
        at = 1:2 - 0.2,
        add = TRUE)
legend("topright", legend = c("Height","Weight"),col = c("blue","red"),pch = c(15,15))

Name	Age	Height	Gender	Weight
LAWRENCE	17	172	M	78.1
JEFFERY	14	169	M	51.3
EDWARD	14	167	M	50.8
PHILLIP	16	167	M	58.1
KIRK	17	167	M	60.8
ROBERT	15	164	M	58.1
JACLYN	12	162	F	65.8
DANNY	15	162	M	48.1
CLAY	15	162	M	47.7
HENRY	14	159	M	54.0
LESLIE	14	159	F	64.5
JOHN	13	159	M	44.5
WILLIAM	15	159	M	50.4
MARTHA	16	159	F	50.8
LEWIS	14	157	M	41.8
AMY	15	157	F	50.8
ALFRED	14	157	M	44.9
CHRIS	14	157	M	44.9
FREDRICK	14	154	M	42.2
CAROL	14	154	F	38.1
JOE	13	154	M	47.7
MARY	15	152	F	41.8
LINDA	17	152	F	52.7
MARK	15	152	M	47.2
PATTY	14	152	F	38.6
ELIZABET	14	152	F	41.3
JUDY	14	149	F	36.8
LOUISE	12	149	F	55.8
ALICE	13	149	F	48.6
JAMES	12	149	M	58.1
MARIAN	16	147	F	52.2
TIM	12	147	M	38.1
BARBARA	13	147	F	50.8
DAVID	13	145	M	35.9
KATIE	12	145	F	43.1
MICHAEL	13	142	M	43.1
SUSAN	13	137	F	30.4
JANE	12	135	F	33.6
LILLIE	12	127	F	29.1
ROBERT	12	125	M	35.9

10.4.8 Calculation with `R`

We also use Weight data to calculate above statistics with R:

weight <- student$Weight
# min
min(weight)

[1] 29.1

# max
max(weight)

[1] 78.1

# median
median(weight)

[1] 47.7

# range
max(weight) - min(weight)

[1] 49

# quantiles
prob1 <- seq(0,1,by= 0.2)
quantile(weight, prob1)

   0%   20%   40%   60%   80%  100% 
29.10 38.50 44.74 50.56 54.36 78.10

# quartiles
prob2 <- c(0.25, 0.5, 0.75, 1)
quantile(weight, prob2)

   25%    50%    75%   100% 
41.675 47.700 52.325 78.100

# interquartile range
IQR(weight)

[1] 10.65

10.5 Other Statistics

10.5.1 Mode

For a set of observations \(x_1,x_2,...,x_n\) and their frequency statistics are: \[ \begin{pmatrix} x_1 & x_2 & \cdots & x_k \\ n_1 & n_2 & \cdots & n_k \end{pmatrix} \] then, we call \(x_{j_0}\) as mode if \(n_{j_0} = \max\{n_1, n_2, \cdots, n_k\}\), represents the value obtained the most from the observed values.

10.5.2 Coefficient of Correlation

In statistical analysis, we need to analyze the relationship between different variables. For continuous variables, the Pearson cofficient of correlation is the most commonly used to describes the linear correlation between variables.

For a set of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) of r,v, \(X,Y\) with sample size \(n\), the Pearson cofficient of correlation \(r\) gives: \[ r = r(x,y) = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2 \sum_{i=1}^n(y_i-\bar y)^2}}. \]

The value of the correlation of coefficient lies in \([-1,1]\);
The correlation of coefficient is positive (negative), also known as \(x\) and \(y\) are positively (negatively) correlated;
The larger absolute value of correlation of coefficient, the stronger the linear relationship between \(x\) and \(y\);
Correlation of coefficient stay unchanged after linear transformation, that is, \[ \tilde x_i = a x_i +b, \quad \tilde y_i = c y_i +d, \quad 1 \leq i \leq n, \] then \[ r(\tilde x, \tilde y) = r(x,y). \]

Particularly, sample covariance gives: \[ c(x,y) = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y). \]

10.5.3 Calculation with `R`

We also use student data to calculate above statistics with R:

weight <- student$Weight
Height <- student$Height
age <- student$Age
# mode
freq <- table(age)
freq

age
12 13 14 15 16 17 
 8  7 12  7  3  3

as.numeric(names(freq)[which.max(freq)])

[1] 14

# correlation of coefficient
cor(weight,height, method = "pearson")

[1] 0.7084378

# covariance
cov(weight,height)

[1] 7.136911

Elements of Statistics

1 Introduction to R

1.1 Brief Introduction to R

1.2 Example Demonstration

2 Variables in R

2.1 Variable in R

3 Data Types in R

3.1 Data Types in R

3.2 Arithmetic operators

3.3 Logical operators

4 Print Formatting

4.1 Print Formatting

5 R Objects

5.1 Vectors

5.2 Matrix

5.3 Data frame

5.4 List

6 Flow Control

6.1 Conditional statement

6.2 Loop Statement

7 Functions

7.1 Custom Functions

7.2 Bulit-in R functions

8 Data Manipulation

8.1 Data Manipulation with dplyr

8.2 Usage of pipe operator %>%

8.3 Data Manipulation with tidyr

9 Data Visualization

9.1 Data Visualization in R

10 Data Description

10.1 Data Set

10.1.1 Data Table

10.1.2 Data File

10.1.3 Table Merging

10.1.4 Special Value

10.1.5 Type of Variables

10.2 Frequency

10.2.1 Frequency Statistics

10.2.2 Bar Plot

10.2.3 Histogram

10.3 Moments Type

10.3.1 Summary of common statistics

10.3.2 Mean

10.3.3 Variance and Standard Variance

10.3.4 Coefficient of skewness and Kurtosis

10.3.5 Calculation with R

10.4 Order Statistic Type

10.4.1 Order Statistic

10.4.2 Median

10.4.3 Range

10.4.4 Empirical Distribution

10.4.5 Quantile

10.4.6 Quartile and Interquartile Range

10.4.7 Box plot

10.4.8 Calculation with R

10.5 Other Statistics

10.5.1 Mode

10.5.2 Coefficient of Correlation

10.5.3 Calculation with R

8.1 Data Manipulation with `dplyr`

8.2 Usage of pipe operator `%>%`

8.3 Data Manipulation with `tidyr`

10.3.5 Calculation with `R`

10.4.8 Calculation with `R`

10.5.3 Calculation with `R`