Universität Leipzig AI & ML in Finance: Lesson Notes 2 – 8
Lesson 2: Introduction
Data Sample:
Training set: Train model using this set
Validation set: Validate and choose model using this set
Test set: Make an estimate of the accuracy of our model using this set
Data Plot:
Model:
Linear: Underfitting
Polynomial of order 2 (Quadratic)
Polynomial of order 5: Best fit, however may indicate potential overfitting.
rmse: Root Mean Square Error
Polynomial model of order two because lowest rmse for both set especially validation set and small difference.
Accuracy is measured by the rmse of the test set.
Left one (balance) would be a good indicator of the probability of default and can be using ML model/algorithms to make automate decisions.
Lesson 3: Data Source
Lesson 4: Data Preprocessing
Data preprocessing is a crucial step in data analysis that involves cleaning, transforming, and organizing raw data in order to make it suitable for further analysis.
Data understanding is the first phase in the data mining process. It involves exploring and getting familiar with the data in order to identify its quality, completeness, and structure, as well as any patterns or relationships that may exist within it.
Data preparation is the process of selecting, cleaning, constructing, integrating and formatting data in order to make it usable for analysis.
R is a programming language and environment for statistical computing and graphics. It is widely used among data analysts and statisticians for data analysis, visualization, and modeling.
Lesson 5: German Credit
Remove directory that is not empty: rm -rf
The length() function in R is used to get or set the length of vectors, lists, or other objects. The syntax is length(data), where data is a required parameter. The function returns a numeric value. For example, length(c(1, 2, 3, 4, 5)) returns 5.
Data Source:
Packages:
tibble:
In the programming language R, a tibble is a data object that is similar to a data frame, but with some minor differences in syntax and printing format.
dplyr
dplyr is a package in R language used to manipulate and summarize data frames or tibbles, including filtering, selecting, arranging, mutating, and summarizing data.
Code
#importing the data
germanCredit = read.table("german.data")
#dimension of the data: 1000 rows (observations of loans) and 21 columns (variables)
dim(germanCredit)
#reducing the number of features of the data to columns 1-5 and column 21
germanCredit<-germanCredit[,c(1:5, 21)]
#exploring the structure of data
str(germanCredit)
#use the documentation to assign column names
colnames(germanCredit)<-c("chkAcctStat", "duration", "credHist", "purpose", "amount", "rating")
#convert data.frame to tibble
library(tibble)
library(dplyr)
germanCredit<-as_tibble(germanCredit)
#printing the tibble
print(germanCredit)
#getting a summary of the info of the sample, this is to check whether there is something wrong with the data (is there an outlier or an error in the data?)
summary(germanCredit)
#transform integer varibales into numeric (float) variables
# %>% is the pipe operator from the dplyr library
germanCredit<- germanCredit %>%
mutate_if(is.integer, as.numeric)
#tranform rating into a factor using the factor()
# the "rating" in the () is the original variable
germanCredit<- germanCredit %>%
mutate(rating=factor(rating, labels =c("Good","Bad")))
#shows the percentage of the good/bad rated credit application
# germanCredit$rating is the rating column in the germanCredit data sample, then divided by the nrow (number of rows in germanCredit)
table(germanCredit$rating)/nrow(germanCredit)
#seen 0.7 good (70% good rating) and 0.3 (30% bad rating)
#calculate WOE and some further ratios
tmpStats<-germanCredit %>%
select(chkAcctStat, rating) %>%
group_by(chkAcctStat) %>%
summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate=mean(rating=="Good"), WOB=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats)
#calculate each unique value of loan duration
tmpStats1<-germanCredit %>%
select(duration, rating) %>%
group_by(duration) %>%
summarize(goodRate=mean(rating=="Good"))
print(tmpStats1)
#form larger groups on a yearly basis
tmpStats2<-germanCredit %>%
select(duration, rating) %>%
mutate(interval=cut(duration, breaks=12*(0:6))) %>%
group_by(interval) %>%
summarize(goodRate=mean(rating=="Good"))
print(tmpStats2)
#calculate some statistics for the five groups
tmpStats3<-germanCredit %>%
select(credHist, rating) %>%
group_by(credHist) %>%
summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate =mean(rating=="Good"), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats3)
R -s -f main.r
[1] 1000 21
'data.frame': 1000 obs. of 6 variables:
$ V1 : chr "A11" "A12" "A14" "A11" ...
$ V2 : int 6 48 12 42 24 36 24 36 12 30 ...
$ V3 : chr "A34" "A32" "A34" "A32" ...
$ V4 : chr "A43" "A43" "A46" "A42" ...
$ V5 : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ V21: int 1 2 1 1 2 1 1 1 1 2 ...
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# A tibble: 1,000 × 6
chkAcctStat duration credHist purpose amount rating
<chr> <int> <chr> <chr> <int> <int>
1 A11 6 A34 A43 1169 1
2 A12 48 A32 A43 5951 2
3 A14 12 A34 A46 2096 1
4 A11 42 A32 A42 7882 1
5 A11 24 A33 A40 4870 2
6 A14 36 A32 A46 9055 1
7 A14 24 A32 A42 2835 1
8 A12 36 A32 A41 6948 1
9 A14 12 A32 A43 3059 1
10 A12 30 A34 A40 5234 2
# ℹ 990 more rows
chkAcctStat duration credHist purpose
Length:1000 Min. : 4.0 Length:1000 Length:1000
Class :character 1st Qu.:12.0 Class :character Class :character
Mode :character Median :18.0 Mode :character Mode :character
Mean :20.9
3rd Qu.:24.0
Max. :72.0
amount rating
Min. : 250 Min. :1.0
1st Qu.: 1366 1st Qu.:1.0
Median : 2320 Median :1.0
Mean : 3271 Mean :1.3
3rd Qu.: 3972 3rd Qu.:2.0
Max. :18424 Max. :2.0
Good Bad
0.7 0.3
# A tibble: 4 × 4
chkAcctStat pctOfTotalObs goodRate WOB
<chr> <dbl> <dbl> <dbl>
1 A11 0.274 0.507 0.0292
2 A12 0.269 0.610 0.446
3 A13 0.063 0.778 1.25
4 A14 0.394 0.883 2.02
# A tibble: 33 × 2
duration goodRate
<dbl> <dbl>
1 4 1
2 5 1
3 6 0.88
4 7 1
5 8 0.857
6 9 0.714
7 10 0.893
8 11 1
9 12 0.726
10 13 1
# ℹ 23 more rows
# A tibble: 6 × 2
interval goodRate
<fct> <dbl>
1 (0,12] 0.788
2 (12,24] 0.703
3 (24,36] 0.601
4 (36,48] 0.479
5 (48,60] 0.533
6 (60,72] 0
# A tibble: 5 × 4
credHist pctOfTotalObs goodRate WOE
<chr> <dbl> <dbl> <dbl>
1 A30 0.04 0.375 -0.511
2 A31 0.049 0.429 -0.288
3 A32 0.53 0.681 0.759
4 A33 0.088 0.682 0.762
5 A34 0.293 0.829 1.58
Lesson 6: WOE
weight of evidence (WOE):
The weight of evidence (WOE) is a statistical measure used in data analysis and predictive modeling to understand the relationship between a binary target variable and one or more input variables. It calculates the strength of the relationship between the variables and can be used to determine which variables are most predictive of the target variable.
#importing the data
germanCredit = read.table("german.data")
#dimension of the data: 1000 rows (observations of loans) and 21 columns (variables)
dim(germanCredit)
#reducing the number of features of the data to columns 1-5 and column 21
germanCredit<-germanCredit[,c(1:5, 21)]
#exploring the structure of data
str(germanCredit)
#use the documentation to assign column names
colnames(germanCredit)<-c("chkAcctStat", "duration", "credHist", "purpose", "amount", "rating")
#convert data.frame to tibble
library(tibble)
library(dplyr)
library(lattice)
library(ggplot2)
germanCredit<-as_tibble(germanCredit)
#printing the tibble
print(germanCredit)
#getting a summary of the info of the sample, this is to check whether there is something wrong with the data (is there an outlier or an error in the data?)
summary(germanCredit)
#transform integer varibales into numeric (float) variables
# %>% is the pipe operator from the dplyr library
germanCredit<- germanCredit %>%
mutate_if(is.integer, as.numeric)
#tranform rating into a factor using the factor()
# the "rating" in the () is the original variable
germanCredit<- germanCredit %>%
mutate(rating=factor(rating, labels =c("Good","Bad")))
#shows the percentage of the good/bad rated credit application
# germanCredit$rating is the rating column in the germanCredit data sample, then divided by the nrow (number of rows in germanCredit)
table(germanCredit$rating)/nrow(germanCredit)
#seen 0.7 good (70% good rating) and 0.3 (30% bad rating)
#calculate WOE and some further ratios
tmpStats<-germanCredit %>%
select(chkAcctStat, rating) %>%
group_by(chkAcctStat) %>%
summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate=mean(rating=="Good"), pctOfGoodRate = sum(rating=="Good")/length(rating), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats)
#two graphs of credit rating ~ checking account status
ggplot(tmpStats, aes(x = chkAcctStat, y = pctOfGoodRate)) +
geom_bar(stat = "identity", fill = "grey") +
xlab("Checking Account Status") +
ylab("Percentage of Good Ratings") +
ggtitle("credit rating ~ checking account status")
ggplot(tmpStats, aes(x = chkAcctStat, y = WOE)) +
geom_bar(stat = "identity", fill = "grey") +
xlab("Checking Account Status") +
ylab("Weight of Evidence") +
ggtitle("credit rating ~ checking account status")
#graph of loan duration
xyplot(rating ~ duration, data = germanCredit)
#calculate each unique value of loan duration
tmpStats1<-germanCredit %>%
select(duration, rating) %>%
group_by(duration) %>%
summarize(goodRate=mean(rating=="Good"))
print(tmpStats1)
#form larger groups on a yearly basis
tmpStats2<-germanCredit %>%
select(duration, rating) %>%
mutate(interval=cut(duration, breaks=12*(0:6))) %>%
group_by(interval) %>%
summarize(goodRate=mean(rating=="Good"))
print(tmpStats2)
#graph of loan duration ~ rating
ggplot(tmpStats2, aes(x = interval, y = goodRate)) +
geom_bar(stat = "identity", fill = "grey") +
xlab("Loan duration in months") +
ylab("Percentage of Good Ratings") +
ggtitle("Loan duration ~ rating")
#calculate some statistics for the five groups
tmpStats3<-germanCredit %>%
select(credHist, rating, duration) %>%
group_by(credHist) %>%
summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate =mean(rating=="Good"), pctOfGoodRate = sum(rating=="Good")/length(rating), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats3)
ggplot(tmpStats3, aes(x = credHist, y = pctOfGoodRate)) +
geom_bar(stat = "identity", fill = "grey") +
xlab("Loan duration in months") +
ylab("Percentage of Good Ratings") +
ggtitle("rating ~ credit history")
ggplot(tmpStats3, aes(x = credHist, y = WOE)) +
geom_bar(stat = "identity", fill = "grey") +
xlab("Loan duration in months") +
ylab("Percentage of Good Ratings") +
ggtitle("rating ~ credit history")
Graphs
library(lattice)
xyplot(y_variable ~ x_variable, data = my_data)
library(ggplot2)
ggplot(data, aes(x = x, y = y)) +
geom_bar(stat = "identity", fill = "blue") +
xlab("x-axis name") +
ylab("y-axis name") +
ggtitle("title")
Lesson 7: Data Generation
Synthetic data refers to artificially generated data that mimics real data in terms of its statistical properties, but does not contain any real-world information. It is often used in situations where real data is difficult to obtain, or to protect the privacy of individuals or sensitive information.
Classification of Synthetic data:
- Fully synthetic data refers to artificially generated data that closely mimics real data in terms of its statistical properties but does not contain any real information. There is strong privacy protection, but the truthfulness of the data is lost.
- Partially synthetic data refers to a dataset that has been partially replaced with synthetic data, while still retaining some of the original, real data. This technique is often used to protect sensitive information, while still allowing researchers to work with a representative dataset.
- Hybrid synthetic data refers to a combination of real data and artificially generated data. This blend of data is used to train machine learning models, improve data privacy, and address data scarcity issues.
Generating Synthetic Data:
- Generating data from a known distribution refers to the process of creating a dataset with values that follow a specific probability distribution, such as the normal distribution or the uniform distribution.
- Fitting a distribution to a known data involves finding the probability distribution that best describes the data. Once a distribution is fitted, it can be used to generate new data points that are consistent with the original data.
- Using deep learning. This method of generating synthetic data involves using deep generative models such as Variational Autoencoder (VAE) or Generative Adversarial Network (GAN). The original data is compressed by an encoder into a more compact structure in the VAE method. Then, a representation of the original data from the compressed data is generated by the decoder. This model/system is trained to minimize the differences between the output and the input data. The GAN model consists of two separate networks: the generator which takes random sample data to create synthetic data set, and the discriminator which compares the synthetic data with the real data set.
Lesson 8: Simple Linear Regression
Statistical Learning refers to the set of tools used for understanding data and explaining statistical data behaviour.
- Supervised Statistical Learning: Estimate/Predict an output based on inputs (Use labeled examples we have seen predict the labels of unlabeled example)
- Unsupervised Statistical Learning: Group input observation (Discover “structure” or underlying patterns in a collection of data)
Simple Linear Regression:
We want the sum of the square of the grey lines to be the minimum (RSS).