R is a widely used tool in statistics, an open-source software of the GNU system, and an excellent tool for statistical calculations and plotting.
RStudio is R’s IDE. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, debugging, and workspace management.
1.2 Example Demonstration
Let’s first provide an example to use iris dataset for preliminary statistical analysis.
The library() function can load the package datasets which iris dataset bulit in.
The head() function Show the first six lines of iris data.
library(datasets) # Load built-in data setshead(iris) # Show the first six lines of iris data
As we can see, the iris data set includes five dimensions which are Sepal.Length , Sepal.Width, Petal.Length ,Petal.Width and Species. We also can use View function to see the whole data set of iris.
The summary() function summarized the basic statistics for each dimension:
summary(iris) # Summary statistics for iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
We also can draw the scatterplot for iris dataset with plot() function.
plot(iris) # Scatterplot matrix for iris data
2 Variables in R
2.1 Variable in R
Variables are used to store data with named locations that your programs can manipulate;
A variable name can be a combination of letters, digits, period and underscore.
Valid variables
Invalid variables
total
tot@l
sum_data
5um
.count
_count
.count.total
TRUE
Var
.0ar
We give some valid named variables.
# character stringhello_string <-'hello'hello_string
We can use class() function to obtain the class of object which includes numeric, character, complex, integer etc.
Similarly, we can use typeof() function to return the type of an object.
Data Types
Output
Logical
TRUE,FALSE
Numeric
10.5,7,845
Integer
3L,40L,4L
Complex
3+2i
Character
‘a’,‘hello’,‘13.5’
Raw
‘Hello’ is stored as 48 65 6c 6c 6f
For example:
x <-100typeof(x)
[1] "double"
class(x)
[1] "numeric"
y <-100Ltypeof(y)
[1] "integer"
class(y)
[1] "integer"
a <-TRUEtypeof(a)
[1] "logical"
class(a)
[1] "logical"
stringhello <-'hello world'typeof(stringhello)
[1] "character"
class(stringhello)
[1] "character"
3.2 Arithmetic operators
R can do basic arithmetic operators, such as:
+ — addition
- — subtraction
* — multiplication
/ — division
%% — remainder
%/% — quotients
abs() — absolute
^ — exponent
exp() — natural exponent
sqrt() — radical
log() — logarithmic
factorial() — factorial
sin(),cos(),tan() — trigonometric
choose() — combination
For examples:
add, subtract, multiply, divide and exponent
2+3*2;2^3;exp(2);2**3;(56-14)/6-4*7*10/(5^2-5);
[1] 8
[1] 8
[1] 7.389056
[1] 8
[1] -7
remainder,quotients
7%%2;7%/%2;9.5%% (-2.7);9.5%/% (-2.7)
[1] 1
[1] 3
[1] -1.3
[1] -4
radical,abs, trigonometric, logarithmic
sqrt(2);sqrt(8);abs(2-4);cos(4*pi);log(0)
[1] 1.414214
[1] 2.828427
[1] 2
[1] 1
[1] -Inf
factorial; choose
factorial(6);choose(5,2)
[1] 720
[1] 10
3.3 Logical operators
Logical operators mainly includes:
> — greater than
< — less than
>= — greater than/equal
<= — less than/equal
= — equal
!= — not equal
& indicates And operation which returns TRUE if both the conditions are true.
| indicates or operation which returns TRUE if any one of the conditions results in TRUE.
! indicates not operation which takes each element of the vector and gives the opposite logical value.
For example:
# numericx <-100;y <-200y > x # greater than
[1] TRUE
y < x # less than
[1] FALSE
y >= x # greater than/equal
[1] TRUE
y <= x # less than/equal
[1] FALSE
x == y # equal to
[1] FALSE
x != y # not equal
[1] TRUE
# vectorx <-1:5;y <-2:6y > x # greater than
[1] TRUE TRUE TRUE TRUE TRUE
y < x # less than
[1] FALSE FALSE FALSE FALSE FALSE
y >= x # greater than/equal
[1] TRUE TRUE TRUE TRUE TRUE
y <= x # less than/equal
[1] FALSE FALSE FALSE FALSE FALSE
x == y # equal to
[1] FALSE FALSE FALSE FALSE FALSE
x != y # not equal
[1] TRUE TRUE TRUE TRUE TRUE
10>20&10<20
[1] FALSE
20>20&10<20
[1] FALSE
20>=20&10<20
[1] TRUE
20>=20&10<20&20<30
[1] TRUE
10>20|10<20
[1] TRUE
20>=20|10<20
[1] TRUE
10>20|10<2
[1] FALSE
10>20|10<2|10>15
[1] FALSE
!10==10
[1] FALSE
!(10==3)
[1] TRUE
These logical operations make data processing very convenient. For instance, we want get the following three data with student data set (Table 1.1-2 in textbook):
Male students with height greater than/equal 169;
Female students with age less than 14;
Students with age greater than/equal 16 or height greater than 168.
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
6 ROBERT 15 164 M 58.1
# Male students with height greater than/equal 169x <- student[student$Gender =="M"& student$Height >=169,]x
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
# Female students with age less than 14y <- student[student$Gender =="F"& student$Age <14,]y
Name Age Height Gender Weight
7 JACLYN 12 162 F 65.8
28 LOUISE 12 149 F 55.8
29 ALICE 13 149 F 48.6
33 BARBARA 13 147 F 50.8
35 KATIE 12 145 F 43.1
37 SUSAN 13 137 F 30.4
38 JANE 12 135 F 33.6
39 LILLIE 12 127 F 29.1
# Students with age greater than/equal 16 or height greater than 168z <- student[student$Age >=16| student$Height >168,]z
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
14 MARTHA 16 159 F 50.8
23 LINDA 17 152 F 52.7
31 MARIAN 16 147 F 52.2
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
4 Print Formatting
4.1 Print Formatting
R uses the print() function to display the variables.
For example:
x <-10print(x)
[1] 10
y ='Hello world'print(y)
[1] "Hello world"
R uses the paste() and paste0() functions to format strings and variables together for printing in a few different ways.
class() — Identify what class this vector belongs to.
For examples:
# Integer vectornum <-1:10num
[1] 1 2 3 4 5 6 7 8 9 10
class(num)
[1] "integer"
# Numeric vector, it has a float, 10.5num <-c(1:10,10.5)num
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 10.5
class(num)
[1] "numeric"
# Character vectorltrs <- letters[1:10]ltrs
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
class(ltrs)
[1] "character"
# Factor vectorfac <-as.factor(ltrs)fac
[1] a b c d e f g h i j
Levels: a b c d e f g h i j
class(fac)
[1] "factor"
If we merge two vectors with different types, they will eventually be unified into one type. For example:
#Create a vector of numbernumbers <-c(1,2,3,4,5,6)class(numbers)
[1] "numeric"
#Create a vector of lettersltrs <-c('a','b','c','d')class(ltrs)
[1] "character"
# concatenating the both abovemixed_vec <-c(numbers,ltrs)# vector mixed_vec has coerced the numbers to characterprint(mixed_vec)
[1] "1" "2" "3" "4" "5" "6" "a" "b" "c" "d"
class(mixed_vec)
[1] "character"
As we can see, the numeric vector becomes characteristic vector.
Basic operations on vectors.
For examples:
x <-c(1,2,3,4)y <-c(5,6,2,1)# additionx+y
[1] 6 8 5 5
# subtractiony-x
[1] 4 4 -1 -3
# multiplicationx*y
[1] 5 12 6 4
# divisiony/x
[1] 5.0000000 3.0000000 0.6666667 0.2500000
# exponenty^x
[1] 5 36 8 1
# sumsum(x)
[1] 10
# Cumulative sumscumsum(x)
[1] 1 3 6 10
# meanmean(x)
[1] 2.5
# variancevar(x)
[1] 1.666667
# standard variancesd(x)
[1] 1.290994
Objects can be explicitly coerced from one class to another using the as. function.
x <-0:6class(x)
[1] "integer"
as.numeric(x)
[1] 0 1 2 3 4 5 6
as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
If R cannot figure out how to coerce an object, this can result in NAs being produced.
x <-c('a','b','c')class(x)
[1] "character"
as.numeric(x)
[1] NA NA NA
as.logical(x)
[1] NA NA NA
as.complex(x)
[1] NA NA NA
R objects have attributes(metadata for object). Example of R object attributes (names, dimnames, dimensions[matrices,arrays], class[integer,numeric],length, other user-defined attributes/metadata).
Not all R objects contain attributes, in which case the attributes() function return NULL.
x <-c(1,2,3,5)attributes(x)
NULL
y <-1attributes(y)
NULL
5.2 Matrix
matrix() — to create a Matrix;
nrow and ncol decide dim of a matrix;
byrow = T means to arrange by row while byrow = F by col.
For example:
Create a Matrix;
m <-matrix(1:9, nrow=3, ncol =3, byrow = T)m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
matrix(1:6, nrow =2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
matrix(1:6, nrow =2,byrow = T)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
# The elements in a matrix may not necessarily be numerical.m1 <-matrix(LETTERS[1:6],nrow =4,ncol =3)m1
[,1] [,2] [,3]
[1,] "A" "E" "C"
[2,] "B" "F" "D"
[3,] "C" "A" "E"
[4,] "D" "B" "F"
gneder height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
str(BMI)
'data.frame': 3 obs. of 4 variables:
$ gneder: chr "Male" "Male" "Female"
$ height: num 152 172 165
$ weight: num 81 93 78
$ Age : num 42 38 26
# We also can create a data frame as follows:name <-c('john','peter','patrick','julie','bob')age <-c(28,30,31,38,35)children <-c(FALSE,TRUE,TRUE,FALSE,TRUE)df <-data.frame(Name = name, Age = age, Children = children)df
Name Age Children
1 john 28 FALSE
2 peter 30 TRUE
3 patrick 31 TRUE
4 julie 38 FALSE
5 bob 35 TRUE
Extract specified elements.
df[3,2]
[1] 31
df[3,]
Name Age Children
3 patrick 31 TRUE
df[2]
Age
1 28
2 30
3 31
4 38
5 35
df$Age
[1] 28 30 31 38 35
df[3,'Age']
[1] 31
df['Age']
Age
1 28
2 30
3 31
4 38
5 35
df[c(3,5),c('Age','Children')]
Age Children
3 31 TRUE
5 35 TRUE
Add new elements.
# We can define a new col and add it to original data frame.height <-c(163,177,163,162,157)df$Height <- heightdf
Name Age Children Height
1 john 28 FALSE 163
2 peter 30 TRUE 177
3 patrick 31 TRUE 163
4 julie 38 FALSE 162
5 bob 35 TRUE 157
# Also we can use 'cbind' to add a new col.weight <-c(75,65,54,34,78)cbind(df,weight)
Name Age Children Height weight
1 john 28 FALSE 163 75
2 peter 30 TRUE 177 65
3 patrick 31 TRUE 163 54
4 julie 38 FALSE 162 34
5 bob 35 TRUE 157 78
# Add a new row is similar.tom =data.frame(Name='Tom',Age =36, Children =FALSE, Height =182)rbind(df,tom)
Name Age Children Height
1 john 28 FALSE 163
2 peter 30 TRUE 177
3 patrick 31 TRUE 163
4 julie 38 FALSE 162
5 bob 35 TRUE 157
6 Tom 36 FALSE 182
Sort in one specific col.
# We can sort the data frame in a specified col.sort(df$Age) # Sort Age from small to large only.
[1] 28 30 31 35 38
ranks =order(df$Age) # Return the order of specified age.ranks
[1] 1 2 3 5 4
df[ranks,] # Return a sorted data frame in Age from small to large.
Name Age Children Height
1 john 28 FALSE 163
2 peter 30 TRUE 177
3 patrick 31 TRUE 163
5 bob 35 TRUE 157
4 julie 38 FALSE 162
Name Age Children Height
4 julie 38 FALSE 162
5 bob 35 TRUE 157
3 patrick 31 TRUE 163
2 peter 30 TRUE 177
1 john 28 FALSE 163
5.4 List
list() — create a list.
list can contain object of different types;
list is different from vector.
For example:
Create a list
# We create a list including numerical vector, character vector and logical vector.list1 <-list(x <-seq(10,30,10),y <-c('a','b','c'),z <-c(TRUE,FALSE))list1
dplyr – A package is used to transform and summarize tabular data with rows and columns.
install.packages("dplyr")
library(dplyr)
There are some frequently-used built-in functions in dplyr, such as:
select()
fillter()
arrange()
summarise()
mutate()
transmute()
group_by()
We will use student data set mentioned above as an example to illustrate the convenience of dplyr. The structure of this data set is as follows:
library(dplyr)student <-read.csv("./students.csv") # read data sethead(student) # Just show first 6 rows.
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
6 ROBERT 15 164 M 58.1
select() – Selects column variables based on their names.
# We select column 'Height' in student.s1 <-select(student, Height)head(s1)
Height
1 172
2 169
3 167
4 167
5 167
6 164
# We select column 'Weight' in student.s2 <-select(student, Weight)head(s2)
Weight
1 78.1
2 51.3
3 50.8
4 58.1
5 60.8
6 58.1
fillter() – Filter rows based on their values.
# We extract all male students.f1 <-filter(student,Gender=="M")head(f1)
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
6 ROBERT 15 164 M 58.1
# We extract female students with height > 155.f2 <-filter(student,Gender=="F", Height>155)head(f2)
Name Age Height Gender Weight
1 JACLYN 12 162 F 65.8
2 LESLIE 14 159 F 64.5
3 MARTHA 16 159 F 50.8
4 AMY 15 157 F 50.8
arrange() – Changes the ordering of the rows.
# We will arrange the students in ascending order of Height.a1 <-arrange(student,Height)head(a1)
Name Age Height Gender Weight
1 ROBERT 12 125 M 35.9
2 LILLIE 12 127 F 29.1
3 JANE 12 135 F 33.6
4 SUSAN 13 137 F 30.4
5 MICHAEL 13 142 M 43.1
6 DAVID 13 145 M 35.9
# We will arrange the students in descending order of Weight.a2 <-arrange(student,desc(Weight))head(a2)
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JACLYN 12 162 F 65.8
3 LESLIE 14 159 F 64.5
4 KIRK 17 167 M 60.8
5 PHILLIP 16 167 M 58.1
6 ROBERT 15 164 M 58.1
summarise() – Reduces multiple values down to a single summary.
# We want obtain the average of Heightsummarise(student,avg_Height=mean(Height,na.rm=T))
avg_Height
1 153.25
# We want obtain the sum of Heightsummarise(student,tot_Height=sum(Height,na.rm=T))
tot_Height
1 6130
# We want obtain the std of Weightsummarise(student,stdev_Weight=sd(Weight,na.rm=T))
stdev_Weight
1 10.07415
# We want obtain the average and sum of Weight simultaneouslysummarise(student,avg_Weight=mean(Weight,na.rm=T),tot_Weight=sum(Weight,na.rm=T))
avg_Weight tot_Weight
1 47.6625 1906.5
mutate() – Creates columns that are functions of existing variables.
# We will add a new column to the original data set and name this column 'BMI' # where BMI = Weight/Height^2.student1 <-mutate(student, BMI = Weight / (0.01*Height)^2)head(student1)
Name Age Height Gender Weight BMI
1 LAWRENCE 17 172 M 78.1 26.39941
2 JEFFERY 14 169 M 51.3 17.96156
3 EDWARD 14 167 M 50.8 18.21507
4 PHILLIP 16 167 M 58.1 20.83259
5 KIRK 17 167 M 60.8 21.80071
6 ROBERT 15 164 M 58.1 21.60173
transmute() – Used to show only new column.
# Same example as above.student2 <-transmute(student, BMI = Weight / (0.01*Height)^2)head(student2)
group_by() – Used to group the data set by some column. But usually it works with other functions.
# We group by Gender in student and calculate the sum and mean of Height for each group.by_type <-group_by(student, Gender)summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
Gender Height_sum Height_mean
<chr> <int> <dbl>
1 F 2684 149.
2 M 3446 157.
As we can see, the student data set is divided into 2 groups by Gender.
8.2 Usage of pipe operator %>%
The pipe symbol %>% in the dplyr package can take the output of the previous function as the input of the next function, which makes the operation more convenient. Let’s see some examples.
Example1: We group by Gender in student and calculate the sum and mean of Height for each group.
# Without pipe %>%.by_type <-group_by(student, Gender)summarise(by_type,Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
Gender Height_sum Height_mean
<chr> <int> <dbl>
1 F 2684 149.
2 M 3446 157.
# With pipe %>%.student %>%group_by(Gender) %>%summarise(Height_sum=sum(Height),Height_mean=mean(Height))
# A tibble: 2 × 3
Gender Height_sum Height_mean
<chr> <int> <dbl>
1 F 2684 149.
2 M 3446 157.
Example2: we screen out all male students, and sampling 5 samples and arrange them in descend order of Height.
# Without pipe %>%.f <-filter(student,Gender =="M")s <-sample_n(f,size =5)a <-arrange(s,desc(Height))a
Name Age Height Gender Weight
1 EDWARD 14 167 M 50.8
2 PHILLIP 16 167 M 58.1
3 HENRY 14 159 M 54.0
4 ALFRED 14 157 M 44.9
5 JOE 13 154 M 47.7
gather() – Reshape data from wide format to long format.
# We elongate the second column 'Face_1' to the fourth column 'Face_3' # into one column and name them 'Face' and 'ResponseTime'.long <- data %>%gather(Face, ResponseTime, Face_1:Face_3)long
separate() – Split a single column into multiple columns.
# We divide the column 'Face' of 'long' data set into two columns,# and name them 'Target' and 'Number'.long_separate <- long %>%separate(Face, c("Target", "Number"), sep ="_")long_separate
ID Target Number ResponseTime
1 1 Face 1 411
2 2 Face 1 723
3 3 Face 1 325
4 4 Face 1 456
5 5 Face 1 579
6 6 Face 1 612
7 7 Face 1 709
8 8 Face 1 513
9 9 Face 1 527
10 10 Face 1 379
11 1 Face 2 123
12 2 Face 2 300
13 3 Face 2 400
14 4 Face 2 500
15 5 Face 2 600
16 6 Face 2 654
17 7 Face 2 789
18 8 Face 2 906
19 9 Face 2 413
20 10 Face 2 567
21 1 Face 3 1457
22 2 Face 3 1000
23 3 Face 3 569
24 4 Face 3 896
25 5 Face 3 956
26 6 Face 3 2345
27 7 Face 3 780
28 8 Face 3 599
29 9 Face 3 1023
30 10 Face 3 678
unite() – Combine multiple columns into a single column.
# We combine the columns 'Target' and 'Number' of 'long_separate' data set # into a single column, we name it 'Face'.long_unite <- long_separate %>%unite(Face, Target, Number, sep ="_")long_unite
spread() – Take two columns (key & value) and spreads in to multiple columns.
# We split the column 'Face' of 'long_unite' data set back into three # columns 'Face_1', 'Face_2' and 'Face_3'.back_to_data <- long_unite %>%spread(Face, ResponseTime)back_to_data
R supports a variety of functions and data visualization packages to build interactive visuals for exploratory data analysis.
plot() – It use more of a generic function for plotting R objects.
barplot() – It is used plot data using rectangular bars.
hist() – It is used to create histograms.
boxplot() – It is used to represent data in the form of quartiles.
ggplot() – This package enables the users to create sophisticated visualizations with little code using the Grammar of Graphics.
plotly() – It create interactive web-based graphs via the open source JavaScript library plotly.js.
Next, we will mainly introduce the usage of plot(), and the barplot(), hist(), boxplot() are introduced in the next section. The other two ggplot() and plotly() are left to students to explore.
We also use student data set to illustrate plot() function.
# We first draw a scatter-plot for Height vs Weight.plot(student$Height, student$Weight,type ="p", # type of plot: points pch =16, # shape of scattered pointsxlab ="Height",ylab ="Weight",xlim =c(120,175),ylim =c(20,80),main ="Hieght vs Weight",col ="red")
# We next draw a Line-plot for Height.plot(student$Height,type ="l", # type of plot: line lty =1, # type of linelwd =3, # width of linexlab ="Student",ylab ="Height",main ="Hieght of Student",ylim =c(120,175),col ="blue")
# We draw three lines for Height, Weight and Age.# standardizationheight <- (student$Height -mean(student$Height))/sd(student$Height)weight <- (student$Weight -mean(student$Weight))/sd(student$Weight)age <- student$Age -mean(student$Age)plot(height,type ="l", # type of plot: line lty =2, # type of Dotted linelwd =3, # width of linexlab ="Student",ylab ="Values",main ="Height, Weight and Age of Student",col ="blue")lines(weight,type ="l", # type of plot: line lty =1, # type of linelwd =3, # width of linecol ="red")points(age, pch =16, col ="brown")legend('topright',legend =c('Height','Weight','Age'),col =c('blue','red','brown'),lty =c(3,1,0), pch =c(NA,NA,16),lwd =c(3,3,2),ncol =1)
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
10 Data Description
10.1 Data Set
10.1.1 Data Table
In statistics analysis, the data is often a description of several individuals (objects), and the description of each individual is a number of indicators (characteristic) of our concern. For example, we conducted a survey on students in a class, denote the \(n\) indicators of \(i\)-th object as \[
x_{i1},x_{i2},\cdots ,x_{in}.
\] There are a total of \(m \times n\) data in the records of \(m\) students, which can be recorded in table:
Number
variable1
variable2
\(\cdots\)
variable\(n\)
1
\(x_{11}\)
\(x_{12}\)
\(\cdots\)
\(x_{1n}\)
2
\(x_{21}\)
\(x_{22}\)
\(\cdots\)
\(x_{2n}\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\ddots\)
\(\vdots\)
\(m\)
\(x_{m1}\)
\(x_{m2}\)
\(\cdots\)
\(x_{mn}\)
The following table shows the information including Name, Height, Age, Gender, Weight of a 40 students class: (Table 1.1-2 in textbook)
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
6 ROBERT 15 164 M 58.1
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
10.1.2 Data File
The data recording each attribute of each observation is stored in computer in a data file with a specific format. These data files have different names on different occasions.
In Mathmatics-It it is a matrix;
In Database-It is a table;
In Statistic-It is records of observation valus;
In R-It is data frame.
Each row of this matrix is called a record or observation, and it records the indicators values of various characteristics of the observation object.
Each column of the matrix is called a field or variable, and it records the indicator value of a certain characteristic of all observed objects.
Generally, data in the same column is required to have the same attributes. For example, the data in a column indicating Gender can only use characters M or F while only number for Height column.
10.1.3 Table Merging
In statistical analysis, it is often necessary to collect individual characteristics of the research object into one table. In RStudio, there three commonly used functions to merge data frames.
rbind()-To merge data frames with same variables vertically.
data1 <- student[1:5,]data1
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
data2 <- student[9:14,]data2
Name Age Height Gender Weight
9 CLAY 15 162 M 47.7
10 HENRY 14 159 M 54.0
11 LESLIE 14 159 F 64.5
12 JOHN 13 159 M 44.5
13 WILLIAM 15 159 M 50.4
14 MARTHA 16 159 F 50.8
data3 <-rbind(data1,data2)data3
Name Age Height Gender Weight
1 LAWRENCE 17 172 M 78.1
2 JEFFERY 14 169 M 51.3
3 EDWARD 14 167 M 50.8
4 PHILLIP 16 167 M 58.1
5 KIRK 17 167 M 60.8
9 CLAY 15 162 M 47.7
10 HENRY 14 159 M 54.0
11 LESLIE 14 159 F 64.5
12 JOHN 13 159 M 44.5
13 WILLIAM 15 159 M 50.4
14 MARTHA 16 159 F 50.8
cbind()-To merge data frames with same number of rows horizontally.
Name Age Height Gender Weight Grades Level
1 LAWRENCE 17 172 M 78.1 88 A
2 JEFFERY 14 169 M 51.3 78 B
3 EDWARD 14 167 M 50.8 90 A
4 PHILLIP 16 167 M 58.1 65 C
5 KIRK 17 167 M 60.8 73 B
merge()-To merge data frames based on common variables.
Name <-c("LAWRENCE","JEFFERY", "EDWARD", "PHILLIP", "KIRK")Grades <-c(88,78,90,65,73)Level <-c("A","B","A","C","B")data6 <-data.frame(Name,Grades,Level)data6
Name Grades Level
1 LAWRENCE 88 A
2 JEFFERY 78 B
3 EDWARD 90 A
4 PHILLIP 65 C
5 KIRK 73 B
data7 <-merge(data1, data6, by ="Name")data7
Name Age Height Gender Weight Grades Level
1 EDWARD 14 167 M 50.8 90 A
2 JEFFERY 14 169 M 51.3 78 B
3 KIRK 17 167 M 60.8 73 B
4 LAWRENCE 17 172 M 78.1 88 A
5 PHILLIP 16 167 M 58.1 65 C
10.1.4 Special Value
During the data collection, sometimes, special values can occur due to certain reasons. In R, there are three special values.
Inf — Infinity;
NaN — Not a Number;
NA — Not Available (Missing value).
some calculation about these special values.
2/0;2/Inf;exp(-Inf);Inf-Inf;Inf/Inf;0/0;
[1] Inf
[1] 0
[1] 0
[1] NaN
[1] NaN
[1] NaN
is.na(),is.nan() — To identify NA and NaN in a data set.
x <-c(5,6,8,NA,23,NA,9,2)is.na(x) # Identify missing values
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
which(is.na(x)) # Identify the index of missing value
[1] 4 6
x[!is.na(x)] # Remove missing values
[1] 5 6 8 23 9 2
sum(is.na(x)) # Count missing values
[1] 2
y <-c(6,6,NaN,0,23,3,9,NaN)is.nan(y) # Identify NaN
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
which(is.nan(y)) # Identify the index of NaN
[1] 3 8
y[!is.nan(y)] # Remove NaN
[1] 6 6 0 23 3 9
sum(is.nan(y)) # Count NaN
[1] 2
na.omit(), na.rm(), -To remove special values.
mean(x, na.rm =TRUE)
[1] 8.833333
mean(y, na.rm =TRUE)
[1] 7.833333
mean(na.omit(x))
[1] 8.833333
mean(na.omit(y))
[1] 7.833333
10.1.5 Type of Variables
According to meaning and measurement, it can be divided into quantitative and qualitative: \[
Quantitative=
\begin{cases}
Continuous \\
Discrete
\end{cases}
\\
Qualitative =
\begin{cases}
Ordinal \\
Nomibnal
\end{cases}
\]
Quantitative: Height, weight, age etc. The value of such variable is a quantity, and it is meaningful to perform arithmetic operations on its value.
Continuous: The possible values of such variables can fill the entire interval, such as height and weight, which also called interval and real.
Discrete: This type of variable can only take on a limited number of values, such as age, etc.
Qualitative: Gender, variety, province etc. Different values of such variables have different meanings, and their different values cannot be arithmetic or have no meaning.
Nomibnal: There is no natural order relationship between the values of such variables, such as gender, province, etc.
Ordinal: Such as: size (extra large, large, medium, small).
Discrete, Nomibnal and Ordinal are collectively referred to as categorical.
10.2 Frequency
10.2.1 Frequency Statistics
The table contains observations for multiple variables, starting with the analysis of a single variable. Let the results of \(n\) observations of a variable: \[x_1, x_2, \cdots, x_n.\] We first need to know What values are taken and the proportions of different values, which refers to the distribution of these observation.
Besides of this, we can perform frequency statistics on multiple variable value combinations.
10.2.2 Bar Plot
In addition to recording frequencies in table, we can also use bar plots where there are multiple columns with the same width juxtaposed, and the height of the column represents the frequency.
# Vertical bar plot of agecol <-rainbow(6)barplot(Freq~Age,data = age,xlab ="Age",ylab ="Frequency",col = col,border ="black",main ="Vertical bar plot of Age distribution")
# Horizontal bar plot of agecol <-rainbow(6)barplot(Freq,names.arg = Age,xlab ="Age",ylab ="Frequency",col = col,border ="black",horiz =TRUE,main ="Horizontal bar plot of Age distribution")
# pie plotAge <-as.character(Age) # Turn number to characterage <-data.frame(Age, Freq)col =rainbow(6)pie(age$Freq,labels = age$Freq,radius =0.9,main ="Pie plot of Age distribution",col = col)legend("topright", age$Age, cex =0.8, fill = col)
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
10.2.3 Histogram
If weight is appropriately grouped based on their values and frequency statistics are conducted, the distribution information of variable values in different ranges can be obtained.
Weight
Frequency
Percent
Cumulative Frequency
Cumulative Percent
24~32
2
5%
2
5%
32~40
7
17.5%
9
22.5%
40~48
12
30%
21
52.5%
48~56
12
30%
33
82.5%
56~64
4
10%
37
92.5%
64~72
2
5%
39
97.5%
72~80
1
2.5%
40
100%
The most commonly used bar plot for calculating the frequency of interval variables after grouping is the histogram:
The width of each column in the histogram is the width of each group;
The area of each column in the histogram is the proportion of observations falling into this group, so the height of the column is the ratio of these observations divided by the width.
# Histogram of weightbreak1 <-c(24,32,40,48,56,64,72,80)col <-rainbow(length(break1))hist(student$Weight,breaks = break1,freq =FALSE,col = col,border ="black",xlab ="Weight",ylab ="Density",main ="Histogram of weight")
# Histogram of weightbreak2 <-seq(20,80,5)col <-rainbow(length(break2))hist(student$Weight,breaks = break2,freq =FALSE,col = col,border ="black",xlab ="Weight",ylab ="Density",main ="Histogram of weight")
# Histogram of weightbreak3 <-seq(20,80,10)col <-rainbow(length(break3))hist(student$Weight,breaks = break3,freq =FALSE,col = col,border ="black",xlab ="Weight",ylab ="Density",main ="Histogram of weight")
# Histogram of weightbreak4 <-seq(29,79,2.5)col <-rainbow(length(break4))hist(student$Weight,breaks = break4,freq =FALSE,col = col,border ="black",xlab ="Weight",ylab ="Density",main ="Histogram of weight")
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
10.3 Moments Type
10.3.1 Summary of common statistics
Based on the method of its generation:
Based on moments of observation: mean, variance, etc;
Based on order statistics of observation: median, range, quantile, etc.
Based on described characteristics:
Describes the center of distribution: mean, median, etc;
Describe the degree of dispersion: variance, range, etc;
Others statistics describing distribution and its shape.
10.3.2 Mean
If a set of observations gives: \(x_1,x_2, \cdots, x_n\), where \(n\) is sample size, then its mean gives that \[
\bar x = \frac{1}{n}\sum_{i=1}^{n}x_i,
\] which is used to describe the central position of this set of observations.
10.3.3 Variance and Standard Variance
Two forms of variance gives that, \[
s^{*2} = s^{*2}(x) = s^{*2}_n(x)= \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x),
\]\[
s^{2} = s^{2}(x) = s^{2}_n(x)= \frac{1}{n}\sum_{i=1}^n (x_i-\bar x),
\] which is always used to describe the dispersion of observations. \(s^{2}_n(x)\) is biased variance and \(s^{*2}_n(x)\) is unbiased variance respectively.
Moreover:
Standard variance: \(s^{*}_n(x) = \sqrt{s^{*2}_n(x)} = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)}\);
Standard error of mean: \(\frac{s}{\sqrt{n}}\);
Coefficient of variation: \(cv = \frac{s}{\bar x} \cdot 100\%\).
If \(y_i = a x_i +b\), \(1\leq i \leq n\), then \[
\bar y = a \bar x, \quad s^{*2}(y) = a^2 s^{*2}(x), \quad s^{*}(y) = |a|s^{*}(x).
\] For \(x = (x_1,x_2, \cdots, x_n)\), then \[
y_i = \frac{x_i-\bar x}{s^{*}(x)},
\] is called standardization of \(x\), also called satndard score. Its mean and variance are 0 and 1 after standardization.
10.3.4 Coefficient of skewness and Kurtosis
Another two statistics related to moments are:
Coefficient of skewness or skewness: \[
g_1 = \frac{1}{ns^3}\sum_{i=1}^n(x_i-\bar x)^3,
\] more precisely, \[
g_1 = \frac{n}{(n-1)(n-2)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^3.
\]
Coefficient of kurtosis or kurtosis\[
g_2 = \frac{1}{ns^4}\sum_{i=1}^n(x_i-\bar x)^4-3,
\] more precisely, \[
g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^n \left( \frac{x_i-\bar x}{s} \right)^4-3\frac{(n-1)^2}{(n-2)(n-3)}.
\]
Notice that:
The skewness describes symmetry of the data about its center position – mean;
The skewness is close to 0 indicating the distribution is symmetrical about its mean;
A positive (negative) skewness coefficient indicates that the data has a longer tail on the right (left).
The kurtosis describe the tail of the data distribution base on the shape of the normal distribution;
The kurtosis is close to 0 as shape of the distribution is close to normal;
A positive (negative) kurtosis indicates the data distribution has thicker (thinner) tails on both sides.
10.3.5 Calculation with R
We use Weight data to calculate above statistics with R:
weight <- student$Weight# meanmean(weight)
[1] 47.6625
# variancevar(weight)
[1] 101.4886
# standard variancesd(weight)
[1] 10.07415
# coefficient of variationsd(weight)/mean(weight)
[1] 0.2113643
library(moments) # skewness and kurtosislibrary(plotrix) # standard error of mean# standard error of meanstd.error(weight)
[1] 1.592863
# skewnessskewness(weight)
[1] 0.5939948
# kurtosiskurtosis(weight)
[1] 3.697771
10.4 Order Statistic Type
10.4.1 Order Statistic
Let \(x_1,x_2, \cdots, x_n\) are a set of observations, arrange them from small to large, \[
x_{1,n} \leq x_{2,n} \leq \cdots \leq x_{n,n},
\] that is, \[
\begin{cases}
x_{(1)} = x_{1,n} = \min\{ x_1, x_2, \cdots, x_n\} \\
x_{(2)} = x_{2,n} = \mathop{\min}\limits_{1\leq i,j \leq n}\max\{ x_i, x_j\}=\mathop{\max}\limits_{1\leq i_1,...,i_{n-1} \leq n}\min\{ x_{i-1}, x_{i_{n-1}}\}\\
\cdots \cdots \\
x_{(n)} = x_{n,n} = \max\{ x_1, x_2, \cdots, x_n\}
\end{cases}
\]\((x_{(1)}, x_{(2)}, \cdots, x_{(n)})\) are called Order Statistics of these observations.
10.4.2 Median
For a set of observation \(x_1,x_2, \cdots, x_n\), median is defined as \[
m=
\begin{cases}
x_{\frac{n+1}{2},n}, ~n~odd, \\
\frac{1}{2}\left(x_{\frac{n}{2},n}+x_{\frac{n}{2}+1,n}\right), ~n~even,
\end{cases}
\] which is also a quantity describing the center of the distribution, so the number of observations greater than or less than the median is roughly half. It is easy to calculate and is not affected by extreme data (robustness).
10.4.3 Range
For observations \(x_1,x_2, \cdots, x_n\), range is defined as \[
r = x_{n,n}-x_{1,n} = \mathop{\max}\limits_{1\leq i \leq n} x_i - \mathop{\min}\limits_{1\leq i \leq n} x_i,
\] which is often used to describe the dispersion.
10.4.4 Empirical Distribution
For observations \(x_1,x_2, \cdots, x_n\), we call discrete distribution: \[
\begin{pmatrix}
x_1 & x_2 & \cdots & x_n \\
\frac{1}{n} & \frac{1}{n} & \cdots & \frac{1}{n}
\end{pmatrix}
\] as empirical distibution. Its empirical distibution function gives \[
\widehat F_n(x)=
\begin{cases}
0,~ x < x_{1,n}\\
\frac{k}{n}, ~x_{k,n} \leq x < x_{k+1,n},~ 1\leq k \leq n-1\\
1,~x\ge x_{n,n}
\end{cases}
=\frac{1}{n}\sum_{i=1}^n I_{\{ x \ge x_i \}},
\] where \(I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)\) is indicator function of \((x \ge x_i)\), that is, \[
I_{\{ x \ge x_i \}} = I_{[x_i, \infty)}(x)=
\begin{cases}
1,~x\ge x_i \\
0,~ otherwise.
\end{cases}
\] Notice that:
mean and variance(biased) are mean and variance of empirical distribution respectively;
All moments type statistics are established based on moments of empirical distribution.
The following plot show the empirical distribution of Weight data set.
Empirical distribution completely describes the distribution of observation and is also an important tool in statistical inference and large-sample theorey.
10.4.5 Quantile
For a \(p \in (0,1)\), the \(p\)-quantile of distribution \(F\) is \(\zeta_p\) satisfying \(F(\zeta_p)=p\). The solution of \(\widehat F_n(x) = p\) may not exist or be not unique even if exist, since empirical distribution is not strictly monotonous and continuous.
For observations \(x_1,x_2, \cdots, x_n\), sample quantile will take the value, satisfying
Makes the number of observations less than \(p\)-quantile approximate \(np\);
Makes the number of observations lager than \(p\)-quantile approximate \(n(1-p)\);
A number close to \(x_{[np],n}\) and \(p\)-quantile of empirical distribution\(\widehat F\).
In practice, there are multiple ways to calculate \(p\)-quantile, but their values are closely when the sample size is too large. In R, the formula gives: \[
z_p = (1-g)x_{j,n} + gx_{j+1,n}
\] where \(j = [(n-1)p]+1\) and \(g = (n-1)p-j+1\). (More ways can be seen in textbook 1.1.17)
10.4.6 Quartile and Interquartile Range
Quartile and percentile are the most commonly used Among the various quantiles.
\(\frac{i}{4}\)-quantile is called the \(i-\)th quartile, denoted as \(q_i\);
0.25-quantile \(q_1\) is called upper quartile;
0.75-quantile \(q_3\) is called lower quartile;
\(\frac{i}{100}\)-quantile is called the \(i-\)th percentile.
Moreover, IQR(interquartile range) defined as \[
IQR = q_3-q_1
\] is also used to describe the dispersion of the distribution.
10.4.7 Box plot
Boxplots are often used to represent quantiles to describe distribution information.
Boxplot consists of a rectangular box with whiskers on both sides;
The upper (lower) side of rectangular box is upper (lower) quartiles respectively, so the width of rectangular box is interquartile range;
The middle horizontal line corresponds to the median;
The whiskers on both sides of box represent the positions of the farthest data points extending from the edge of quartile to 1.5 times the IQR;
The top (bottom) line corresponds to max (min) value;
Extreme data will lie out of this range.
We also use student dataset to show some boxplots:
# Weight and HeightWeight <- student$WeightHeight <- student$Height# Standardizationweight <- (Weight-mean(Weight))/sd(Weight)height <- (Height-mean(Height))/sd(Height)col <-rainbow(2)boxplot(weight, height,names =c("Weight","Height"),col = col,main ="Weight and Height for students")
# Weight for different gendercol <-rainbow(2)boxplot(Weight~Gender,data = student,col = col,main ="Weight with diffrent gender")
# Weight and Height for different gendergender <- student$Genderboxplot(weight~gender,col ="red",boxwex =0.3, # set width of boxat =1:2+0.2, # set position of box on x-axisylab ="Values",main ="Weight and Height for different gender")boxplot(height~gender,col ="blue",boxwex =0.3,at =1:2-0.2,add =TRUE)legend("topright", legend =c("Height","Weight"),col =c("blue","red"),pch =c(15,15))
Name
Age
Height
Gender
Weight
LAWRENCE
17
172
M
78.1
JEFFERY
14
169
M
51.3
EDWARD
14
167
M
50.8
PHILLIP
16
167
M
58.1
KIRK
17
167
M
60.8
ROBERT
15
164
M
58.1
JACLYN
12
162
F
65.8
DANNY
15
162
M
48.1
CLAY
15
162
M
47.7
HENRY
14
159
M
54.0
LESLIE
14
159
F
64.5
JOHN
13
159
M
44.5
WILLIAM
15
159
M
50.4
MARTHA
16
159
F
50.8
LEWIS
14
157
M
41.8
AMY
15
157
F
50.8
ALFRED
14
157
M
44.9
CHRIS
14
157
M
44.9
FREDRICK
14
154
M
42.2
CAROL
14
154
F
38.1
JOE
13
154
M
47.7
MARY
15
152
F
41.8
LINDA
17
152
F
52.7
MARK
15
152
M
47.2
PATTY
14
152
F
38.6
ELIZABET
14
152
F
41.3
JUDY
14
149
F
36.8
LOUISE
12
149
F
55.8
ALICE
13
149
F
48.6
JAMES
12
149
M
58.1
MARIAN
16
147
F
52.2
TIM
12
147
M
38.1
BARBARA
13
147
F
50.8
DAVID
13
145
M
35.9
KATIE
12
145
F
43.1
MICHAEL
13
142
M
43.1
SUSAN
13
137
F
30.4
JANE
12
135
F
33.6
LILLIE
12
127
F
29.1
ROBERT
12
125
M
35.9
10.4.8 Calculation with R
We also use Weight data to calculate above statistics with R:
For a set of observations \(x_1,x_2,...,x_n\) and their frequency statistics are: \[
\begin{pmatrix}
x_1 & x_2 & \cdots & x_k \\
n_1 & n_2 & \cdots & n_k
\end{pmatrix}
\] then, we call \(x_{j_0}\) as mode if \(n_{j_0} = \max\{n_1, n_2, \cdots, n_k\}\), represents the value obtained the most from the observed values.
10.5.2 Coefficient of Correlation
In statistical analysis, we need to analyze the relationship between different variables. For continuous variables, the Pearson cofficient of correlation is the most commonly used to describes the linear correlation between variables.
For a set of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) of r,v, \(X,Y\) with sample size \(n\), the Pearson cofficient of correlation\(r\) gives: \[
r = r(x,y) = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2 \sum_{i=1}^n(y_i-\bar y)^2}}.
\]
The value of the correlation of coefficient lies in \([-1,1]\);
The correlation of coefficient is positive (negative), also known as \(x\) and \(y\) are positively (negatively) correlated;
The larger absolute value of correlation of coefficient, the stronger the linear relationship between \(x\) and \(y\);
Correlation of coefficient stay unchanged after linear transformation, that is, \[
\tilde x_i = a x_i +b, \quad \tilde y_i = c y_i +d, \quad 1 \leq i \leq n,
\] then \[
r(\tilde x, \tilde y) = r(x,y).
\]