Universität Leipzig AI & ML in Finance: Lesson Notes 2 – 8

Lesson 2: Introduction

Data Sample:

Training set: Train model using this set

Validation set: Validate and choose model using this set

Test set: Make an estimate of the accuracy of our model using this set

Data Plot:

Model:

Linear: Underfitting

Polynomial of order 2 (Quadratic)

Polynomial of order 5: Best fit, however may indicate potential overfitting.

rmse: Root Mean Square Error

Polynomial model of order two because lowest rmse for both set especially validation set and small difference.

Accuracy is measured by the rmse of the test set.

Left one (balance) would be a good indicator of the probability of default and can be using ML model/algorithms to make automate decisions.

Lesson 3: Data Source

Lesson 4: Data Preprocessing

Data preprocessing is a crucial step in data analysis that involves cleaning, transforming, and organizing raw data in order to make it suitable for further analysis.

Data understanding is the first phase in the data mining process. It involves exploring and getting familiar with the data in order to identify its quality, completeness, and structure, as well as any patterns or relationships that may exist within it.

Data preparation is the process of selecting, cleaning, constructing, integrating and formatting data in order to make it usable for analysis.

R is a programming language and environment for statistical computing and graphics. It is widely used among data analysts and statisticians for data analysis, visualization, and modeling.

Lesson 5: German Credit

Remove directory that is not empty: rm -rf

The length() function in R is used to get or set the length of vectors, lists, or other objects. The syntax is length(data), where data is a required parameter. The function returns a numeric value. For example, length(c(1, 2, 3, 4, 5)) returns 5.

Data Source:

Packages:

tibble:

In the programming language R, a tibble is a data object that is similar to a data frame, but with some minor differences in syntax and printing format.

dplyr

dplyr is a package in R language used to manipulate and summarize data frames or tibbles, including filtering, selecting, arranging, mutating, and summarizing data.

Code

#importing the data
germanCredit = read.table("german.data")

#dimension of the data: 1000 rows (observations of loans) and 21 columns (variables)
dim(germanCredit)

#reducing the number of features of the data to columns 1-5 and column 21
germanCredit<-germanCredit[,c(1:5, 21)]

#exploring the structure of data
str(germanCredit)

#use the documentation to assign column names
colnames(germanCredit)<-c("chkAcctStat", "duration", "credHist", "purpose", "amount", "rating")

#convert data.frame to tibble
library(tibble)
library(dplyr)
germanCredit<-as_tibble(germanCredit)

#printing the tibble
print(germanCredit)
#getting a summary of the info of the sample, this is to check whether there is something wrong with the data (is there an outlier or an error in the data?)
summary(germanCredit)

#transform integer varibales into numeric (float) variables
# %>% is the pipe operator from the dplyr library
germanCredit<- germanCredit %>% 
                mutate_if(is.integer, as.numeric)

#tranform rating into a factor using the factor()
# the "rating" in the () is the original variable
germanCredit<- germanCredit %>%
                mutate(rating=factor(rating, labels =c("Good","Bad")))

#shows the percentage of the good/bad rated credit application
# germanCredit$rating is the rating column in the germanCredit data sample, then divided by the nrow (number of rows in germanCredit)
table(germanCredit$rating)/nrow(germanCredit)
#seen 0.7 good (70% good rating) and 0.3 (30% bad rating)

#calculate WOE and some further ratios
tmpStats<-germanCredit %>%
            select(chkAcctStat, rating) %>%
            group_by(chkAcctStat) %>%
            summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate=mean(rating=="Good"), WOB=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats)

#calculate each unique value of loan duration
tmpStats1<-germanCredit %>%
            select(duration, rating) %>%
            group_by(duration) %>%
            summarize(goodRate=mean(rating=="Good"))
print(tmpStats1)

#form larger groups on a yearly basis
tmpStats2<-germanCredit %>%
              select(duration, rating) %>%
              mutate(interval=cut(duration, breaks=12*(0:6))) %>%
              group_by(interval) %>%
              summarize(goodRate=mean(rating=="Good"))
print(tmpStats2)

#calculate some statistics for the five groups
tmpStats3<-germanCredit %>%
            select(credHist, rating) %>%
            group_by(credHist) %>%
            summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate =mean(rating=="Good"), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats3)
 R -s -f main.r
[1] 1000   21
'data.frame':   1000 obs. of  6 variables:
 $ V1 : chr  "A11" "A12" "A14" "A11" ...
 $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
 $ V3 : chr  "A34" "A32" "A34" "A32" ...
 $ V4 : chr  "A43" "A43" "A46" "A42" ...
 $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ V21: int  1 2 1 1 2 1 1 1 1 2 ...

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

# A tibble: 1,000 × 6
   chkAcctStat duration credHist purpose amount rating
   <chr>          <int> <chr>    <chr>    <int>  <int>
 1 A11                6 A34      A43       1169      1
 2 A12               48 A32      A43       5951      2
 3 A14               12 A34      A46       2096      1
 4 A11               42 A32      A42       7882      1
 5 A11               24 A33      A40       4870      2
 6 A14               36 A32      A46       9055      1
 7 A14               24 A32      A42       2835      1
 8 A12               36 A32      A41       6948      1
 9 A14               12 A32      A43       3059      1
10 A12               30 A34      A40       5234      2
# ℹ 990 more rows
 chkAcctStat           duration      credHist           purpose         
 Length:1000        Min.   : 4.0   Length:1000        Length:1000       
 Class :character   1st Qu.:12.0   Class :character   Class :character  
 Mode  :character   Median :18.0   Mode  :character   Mode  :character  
                    Mean   :20.9                                        
                    3rd Qu.:24.0                                        
                    Max.   :72.0                                        
     amount          rating   
 Min.   :  250   Min.   :1.0  
 1st Qu.: 1366   1st Qu.:1.0  
 Median : 2320   Median :1.0  
 Mean   : 3271   Mean   :1.3  
 3rd Qu.: 3972   3rd Qu.:2.0  
 Max.   :18424   Max.   :2.0  

Good  Bad 
 0.7  0.3 
# A tibble: 4 × 4
  chkAcctStat pctOfTotalObs goodRate    WOB
  <chr>               <dbl>    <dbl>  <dbl>
1 A11                 0.274    0.507 0.0292
2 A12                 0.269    0.610 0.446 
3 A13                 0.063    0.778 1.25  
4 A14                 0.394    0.883 2.02  
# A tibble: 33 × 2
   duration goodRate
      <dbl>    <dbl>
 1        4    1    
 2        5    1    
 3        6    0.88 
 4        7    1    
 5        8    0.857
 6        9    0.714
 7       10    0.893
 8       11    1    
 9       12    0.726
10       13    1    
# ℹ 23 more rows
# A tibble: 6 × 2
  interval goodRate
  <fct>       <dbl>
1 (0,12]      0.788
2 (12,24]     0.703
3 (24,36]     0.601
4 (36,48]     0.479
5 (48,60]     0.533
6 (60,72]     0    
# A tibble: 5 × 4
  credHist pctOfTotalObs goodRate    WOE
  <chr>            <dbl>    <dbl>  <dbl>
1 A30              0.04     0.375 -0.511
2 A31              0.049    0.429 -0.288
3 A32              0.53     0.681  0.759
4 A33              0.088    0.682  0.762
5 A34              0.293    0.829  1.58

Lesson 6: WOE

weight of evidence (WOE):

The weight of evidence (WOE) is a statistical measure used in data analysis and predictive modeling to understand the relationship between a binary target variable and one or more input variables. It calculates the strength of the relationship between the variables and can be used to determine which variables are most predictive of the target variable.

#importing the data
germanCredit = read.table("german.data")

#dimension of the data: 1000 rows (observations of loans) and 21 columns (variables)
dim(germanCredit)

#reducing the number of features of the data to columns 1-5 and column 21
germanCredit<-germanCredit[,c(1:5, 21)]

#exploring the structure of data
str(germanCredit)

#use the documentation to assign column names
colnames(germanCredit)<-c("chkAcctStat", "duration", "credHist", "purpose", "amount", "rating")

#convert data.frame to tibble
library(tibble)
library(dplyr)
library(lattice)
library(ggplot2)
germanCredit<-as_tibble(germanCredit)

#printing the tibble
print(germanCredit)
#getting a summary of the info of the sample, this is to check whether there is something wrong with the data (is there an outlier or an error in the data?)
summary(germanCredit)

#transform integer varibales into numeric (float) variables
# %>% is the pipe operator from the dplyr library
germanCredit<- germanCredit %>% 
                mutate_if(is.integer, as.numeric)

#tranform rating into a factor using the factor()
# the "rating" in the () is the original variable
germanCredit<- germanCredit %>%
                mutate(rating=factor(rating, labels =c("Good","Bad")))

#shows the percentage of the good/bad rated credit application
# germanCredit$rating is the rating column in the germanCredit data sample, then divided by the nrow (number of rows in germanCredit)
table(germanCredit$rating)/nrow(germanCredit)
#seen 0.7 good (70% good rating) and 0.3 (30% bad rating)

#calculate WOE and some further ratios
tmpStats<-germanCredit %>%
            select(chkAcctStat, rating) %>%
            group_by(chkAcctStat) %>%
            summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate=mean(rating=="Good"), pctOfGoodRate = sum(rating=="Good")/length(rating), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats)

#two graphs of credit rating ~ checking account status
ggplot(tmpStats, aes(x = chkAcctStat, y = pctOfGoodRate)) +
  geom_bar(stat = "identity", fill = "grey") +
  xlab("Checking Account Status") +
  ylab("Percentage of Good Ratings") +
  ggtitle("credit rating ~ checking account status")

ggplot(tmpStats, aes(x = chkAcctStat, y = WOE)) +
  geom_bar(stat = "identity", fill = "grey") +
  xlab("Checking Account Status") +
  ylab("Weight of Evidence") +
  ggtitle("credit rating ~ checking account status")

#graph of loan duration
xyplot(rating ~ duration, data = germanCredit)

#calculate each unique value of loan duration
tmpStats1<-germanCredit %>%
            select(duration, rating) %>%
            group_by(duration) %>%
            summarize(goodRate=mean(rating=="Good"))
print(tmpStats1)

#form larger groups on a yearly basis
tmpStats2<-germanCredit %>%
              select(duration, rating) %>%
              mutate(interval=cut(duration, breaks=12*(0:6))) %>%
              group_by(interval) %>%
              summarize(goodRate=mean(rating=="Good"))
print(tmpStats2)

#graph of loan duration ~ rating
ggplot(tmpStats2, aes(x = interval, y = goodRate)) +
  geom_bar(stat = "identity", fill = "grey") +
  xlab("Loan duration in months") +
  ylab("Percentage of Good Ratings") +
  ggtitle("Loan duration ~ rating")

#calculate some statistics for the five groups
tmpStats3<-germanCredit %>%
            select(credHist, rating, duration) %>%
            group_by(credHist) %>%
            summarize(pctOfTotalObs = length(rating)/nrow(germanCredit), goodRate =mean(rating=="Good"), pctOfGoodRate = sum(rating=="Good")/length(rating), WOE=log(sum(rating=="Good")/sum(rating=="Bad")))
print(tmpStats3)

ggplot(tmpStats3, aes(x = credHist, y = pctOfGoodRate)) +
  geom_bar(stat = "identity", fill = "grey") +
  xlab("Loan duration in months") +
  ylab("Percentage of Good Ratings") +
  ggtitle("rating ~ credit history")

ggplot(tmpStats3, aes(x = credHist, y = WOE)) +
  geom_bar(stat = "identity", fill = "grey") +
  xlab("Loan duration in months") +
  ylab("Percentage of Good Ratings") +
  ggtitle("rating ~ credit history")

Graphs

library(lattice)
xyplot(y_variable ~ x_variable, data = my_data)

library(ggplot2)
ggplot(data, aes(x = x, y = y)) +
  geom_bar(stat = "identity", fill = "blue") +
  xlab("x-axis name") +
  ylab("y-axis name") +
  ggtitle("title")

Lesson 7: Data Generation

Synthetic data refers to artificially generated data that mimics real data in terms of its statistical properties, but does not contain any real-world information. It is often used in situations where real data is difficult to obtain, or to protect the privacy of individuals or sensitive information.

Classification of Synthetic data:

  • Fully synthetic data refers to artificially generated data that closely mimics real data in terms of its statistical properties but does not contain any real information. There is strong privacy protection, but the truthfulness of the data is lost.
  • Partially synthetic data refers to a dataset that has been partially replaced with synthetic data, while still retaining some of the original, real data. This technique is often used to protect sensitive information, while still allowing researchers to work with a representative dataset.
  • Hybrid synthetic data refers to a combination of real data and artificially generated data. This blend of data is used to train machine learning models, improve data privacy, and address data scarcity issues.

Generating Synthetic Data:

  • Generating data from a known distribution refers to the process of creating a dataset with values that follow a specific probability distribution, such as the normal distribution or the uniform distribution.
  • Fitting a distribution to a known data involves finding the probability distribution that best describes the data. Once a distribution is fitted, it can be used to generate new data points that are consistent with the original data.
  • Using deep learning. This method of generating synthetic data involves using deep generative models such as Variational Autoencoder (VAE) or Generative Adversarial Network (GAN). The original data is compressed by an encoder into a more compact structure in the VAE method. Then, a representation of the original data from the compressed data is generated by the decoder. This model/system is trained to minimize the differences between the output and the input data. The GAN model consists of two separate networks: the generator which takes random sample data to create synthetic data set, and the discriminator which compares the synthetic data with the real data set.

Lesson 8: Simple Linear Regression

Statistical Learning refers to the set of tools used for understanding data and explaining statistical data behaviour.

  • Supervised Statistical Learning: Estimate/Predict an output based on inputs (Use labeled examples we have seen predict the labels of unlabeled example)
  • Unsupervised Statistical Learning: Group input observation (Discover “structure” or underlying patterns in a collection of data)

Simple Linear Regression:

The blue line is the (estimated) regression line. TV being the predictor, and Sales being the output. Red dots are the actual observation. Grey lines highlight the errors.

We want the sum of the square of the grey lines to be the minimum (RSS).

The red dot in the middle represents the minimum value for the RSS

Author photo
Publication date:
Author: Brianna