Forum, Tips & Tricks

Learn tips and tricks and engage in discussions with our blog authors here. It’s all for your benefit and learning. 

You need to log in to create posts and topics.

Advanced analytics - logistic regression

Bit on the side of classic Power BI functionalities lies the possibility of utilizing the power of R code (or Python). R visualisation enables user to do different, and perhaps more complex graphs than normal Power BI visualisations. Second benefit (apart from others) constitute of employing regression models, perhaps for predictive analytics. One of such models I am going to talk about today is Logistic regression. Specifically, I will talk about the coding side of the modelling, not about the theoretical part.

Start with creating R visual on your dashboard and by including the data into the visualisation as you are used to with normal Power BI functionalities. Now, the classic R coding is employed. Following code runs simple regression of dependent variable and one independent variable (in our case, it measures the effect of employee age on the possibility of leave(turnover)):

model <- glm(dummy_model_leave~age, family="binomial", data=dataset)

 

Specification "binomial" says to R that the model is logistic regression, i.e. the dependent variable takes values of 0 or 1 (binary).

Next, if we want to extract the probabilities of "dependent variable“ happening, in our case the probability of leave for employees who are still in the company,  we can create the matrix for storing data, and employ looping mechanism and predict function in order to include the probabilities into the matrix:

 

c <-nrow(dataset)

g <- matrix(nrow = c, ncol = 1)

colnames(g) <- c("Probability")

for (i in 1:c){

      newdata <- dataset[i,] 

      m <- max(predict(model,newdata,type="response"), na.rm = FALSE)

      g[i,1]=m

    

    }

 

dataset = cbind(dataset, g)

 

Calculated probabilities may be visualised for instance as a bar chart - lets take 30 employees with highest probability of leave - we will use ggplot for bar chart:

 

z <- as.data.frame(dataset)

    z <- z[z$dummy_model_turnover == 0,] #takes employees who are still in the company

 z <- z[order(z$Probability, decreasing= T),]

 z = z[1:30,]

    z <- as.data.frame(z)

    z$Probability <- as.numeric(as.character(z$Probability))

    z$Probability <- round(z$Probability, digits = 3)

    z$Probability<-z$Probability*100

    z$Probability <- paste(z$Probability, "%") #creates percentages

    library(ggplot2)

    ggplot(z, aes(x=z$Příjmení, y=z$Probability )) + geom_bar(stat="identity",  fill = "#244070") + coord_flip()+

      labs(x="Employees", y="Probability of leave")+geom_text(aes(label=Probability), size = 3, colour ="#FFFFFF", position = position_stack(vjust = 0.5))

And here you go, you are finished! Well not really, logistic regression requires series of testing (out-of-sample, McFadden´s Pseudo Rsquared,..), but more of it some other time.