Employee attrition data analysis in r

This is the last article in a series of three articles on employee churn published on AIHR Analytics. In this article I will demonstrate how to build, evaluate and deploy your predictive turnover model, using R.

1969 opel gt

To start I will first briefly introduce my vision on employee churn and summarize the data science process. You may be asking what is Employee Churn? In one word, turnover, its when employees leave the organization.

In another word, terminates, whether it be voluntary or involuntary. In the widest sense churn or turnover is concerned with both the calculation of rates of people leaving the organization and the individual terminates themselves. We calculate past rates or turnover in an attempt to predict future turnover rates.

And indeed it is important to do that and to continue to do so. Data warehousing tools are very powerful in this regard to slice and dice this data efficiently over different time periods at different levels of granularity. BUT it is only half the picture. This is why we are interested in the individual in addition to the aggregate. This is from the marketing context. In many businesses such a cell phone companies and others, it is far harder to generate and attract new customers than it is to keep old ones.

So businesses want to do what they can to keep existing customers. This kind of thinking and mindset applies to Human Resources as well. So we have a marketing principle that applies to the management of human resources, and a data science set of algorithms that can help determine whether there are patterns of churn in our data that could help predict future churn.

This analysis is an example of how HR needs to start thinking outside of its traditional box. This analysis helps to address future HR challenges and issues. On a personal level, I like to think of People Analytics as when the data science process is applied to HR information. For that reason, I would like to revisit what that process is and use it as the framework to guide the rest of the example illustrated in this article. It realizes that when good people leave, it costs far more to replace them than providing them with some incentives to keep them.

So it would like to be data-driven in the HR decisions it makes with respect to employee retention. Often the data to analyze the problem starts with what is currently readily available. After some initial prototyping of predictive models, ideas surface for additional data collection to further refine the model. Since this is first stab at this, the organization uses only what is readily available.Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company.

Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff. As such, there is great business interest in understanding the drivers of, and minimizing staff attrition. While this model can be routinely run to identify employees who are most likely to quit, the key driver of success would be the human element of reaching out the employee, understanding the current situation of the employee and taking action to remedy controllable factors that can prevent attrition of the employee.

This data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately entries. Given the limited size of the data set, the model should only be expected to provide modest improvement in indentification of attrition vs a random allocation of probability of attrition. While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped will significantly help improve the operations of most businesses.

This could generate new insights for the business on what drives attrition, insights that cannot be generated by merely informational interviews with employees.

IBM has gathered information on employee satisfaction, income, seniority and some demographics. It includes the data of employees. To use a matrix structure, we changed the model to reflect the followin data. We plan to run a Logistic regression model and CART to determine the probability of a certain employee to fall into the condition of Attrition and thus its high risk of leaving the company.

We will then test different parameters and probability threshold using confusion Matrixes, Area under the Curve and Gini Coefficient to determine which of the three models is the best predictor and will reccommend its use in practice. This is how the first 10 out of the total of rows look like transposed, for convenience :. We split the data into an estimation sample and two validation samples - using a randomized splitting technique.

The second validation data mimic out-of-sample data, and the performance on this validation set is a better approximation of the performance one should expect in practice from the selected classification method.

Given our decisions, we decide to use a number of classification methods to develop a model that discriminates the different classes. In this paper we will consider: logistic regression and classification and regression trees CART. Logistic Regression : Logistic Regression is a method similar to linear regression except that the dependent variable is discrete e. Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion.

For example, this is the logistic regression parameters for our data:. Given a set of independent variables, the output of the estimated logistic regression the sum of the products of the independent variables with the corresponding regression coefficients can be used to assess the probability an observation belongs to one of the classes.Employee attrition is costly. Because of this, it has become increasingly popular to use data analysis methods and technology to understand and manage employee attrition.

In Part 1, we will perform some exploratory data analysis using R Studio. We will also perform more multivariate analysis on our test data and discuss useful algorithms and R packages for survival analysis. Before any data analysis can be done we first need to load the employee turnover sample data into RStudio.

The sample data has 1, rows and 35 columns i. Collecting and processing the data is probably the most tedious and time-consuming part of a data analysis project. We would also need to transform the data through scaling, decomposition or aggregation, and feature engineering. Since we are using a pre-processed sample dataset, these tasks are out-of-scope for this blog post.

Take the time to get your data right — remember: garbage in, garbage out. After processing and transforming your data, we can do some exploratory data analysis EDA. The goal of EDA is to get to know your data, get some descriptive statistics, and investigate relationships that may exist.

A lot of interesting insights can be gleaned from EDA. Moreover, some of these insights can be very useful in the future when we get into the modeling phase of the project.

employee attrition data analysis in r

For example, we might want to check if there are different attrition rates across different types of jobs. Use the ggplot2 package and this code to create two plots that show attrition rates across job roles. As you can see above in one of the plots we created, the Sales Representative job role has significantly more attrition relative to other job roles.

The next plot could also be very interesting. It shows how overtime as a proxy for hourly vs. As can be seen above, hourly employees have significantly higher attrition rates as compared to exempt employees with no overtime. Here are other interesting plots that show some interesting relationships and give clues on what variables affect attrition rates such as job involvement and business travel.

Another cool thing we can do on R is check the relationship across three variables based on a hypothesis. For example, in the plot below, we check how job role and business travel affects attrition.Manipulate, visualize, and perform statistical tests on HR data.

You also accept that you are aware that your data will be stored outside of the EU and that you are above the age of HR analytics, people analytics, workforce analytics -- whatever you call it, businesses are increasingly counting on their human resources departments to answer questions, provide insights, and make recommendations using data about their employees. In this course, you'll learn how to manipulate, visualize, and perform statistical tests on HR data through a series of HR analytics case studies.

In this chapter, you will get an introduction to how data science is used in a human resources context. Then you will dive into a case study where you'll analyze and visualize recruiting data to determine which source of new candidates ultimately produces the best new hires. The dataset you'll use in this and the other chapters in this course is synthetic, to maintain the privacy of actual employees.

When employers make a new hire, they must determine what the new employee will be paid. If the employer is not careful, the new hires can come in with a higher salary than the employees that currently work at the same job, which can cause employee turnover and dissatisfaction. In this chapter, you will check whether new hires are really getting paid more than current employees, and how to double-check your initial observations.

In many industries, workplace safety is a critical consideration. Maintaining a safe workplace provides employees with confidence and reduces costs for workers' compensation and legal liabilities. In this chapter, you'll look for explanations for an increase in workplace accidents. Gallup defines engaged employees as those who are involved in, enthusiastic about and committed to their work and workplace.

There is disagreement about the strength of the connection between employee engagement and business outcomes, but the idea is that employees that are more engaged will be more productive and stay with the organization longer. In this chapter, you'll look into potential reasons that one department's engagement scores are lower than the rest.

Performance management helps an organization keep track of which employees are providing extra value, or below-average value, and compensating them accordingly. Whether performance is a rating or the result of a questionnaire, whether employees are rated each year or more often than that, the process is somewhat subjective.

An organization should check that ratings are being given with regard to performance, and not individual managers' preferences, or even biases conscious or subconscious.

His passion is using data science to improve the way people experience work, manage others at work, and get work done. Pricing See our plans.

employee attrition data analysis in r

Plans For Business For Students. Create Free Account. Sign in. If you typeEvery year a lot of companies hire a number of employees. The companies invest time and money in training those employees, not just this but there are training programs within the companies for their existing employees as well.

HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Dancho, Business Science

The aim of these programs is to increase the effectiveness of their employees. But where HR Analytics fit in this? HR Analytics. Human resource analytics HR analytics is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment.

HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes. Attrition in HR. Attrition in human resources refers to the gradual loss of employees over time.

In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture and motivation systems that help the organization retain top employees.

How does Attrition affect companies? We will discuss the first question here and for the second question we will write the code and try to understand the process step by step. Attrition affecting Companies. A major problem in high employee attrition is its cost to an organization.

Employee Attrition Analysis

Job postings, hiring processes, paperwork and new hire training are some of the common expenses of losing employees and replacing them. Additionally, regular employee turnover prohibits your organization from increasing its collective knowledge base and experience over time.

This is especially concerning if your business is customer facing, as customers often prefer to interact with familiar people. Errors and issues are more likely if you constantly have new workers.

IBM HR Analytics Employee Attrition & Performance

Hope the basics made sense. For this exercise, we will try to study the factors that lead to employee attrition. This is a fictional data set created by IBM data scientists. We need to first check the data type of the features, why?

For this exercise, our aim is to predict the employee attrition and it is important to see which variables are contributing the most in attrition. But before that we need to know if the variables are correlated if they are, we might want to avoid those in model building process.

Reddit nmn

There are many continuous variables, we can have a look at their distribution and create a grid of pair plots but that would be too much code to see the correlation as there are a lot variables. Rather, we can create a seaborn heatmap of numeric variables and see the correlation. The variables which are not poorly correlated i. From the above heat map we can now see which variables are poorly correlated and which ones are strongly correlated. Now replace other categorical variables with dummy values.

We have our final dataset. We now have to start modelling- Predicting the Attrition. Are you also confused like me? We already have the Attrition data then what is it here to predict? Well most of the time in Regression and classification problem, you run your model with the available values and check the metrics like accuracy of the model by comparing observed values with true values.

Now you will realize that, how important the training data phase is. We train the model in a way that it can predict almost correct results. We will train the model on training data and predict the results on test data.Attrition is a common issue that every company has to deal with. The data set has instances and 10 features, and without any missing value.

As a result I used accuracy only to get a quick idea of the model performances. Skewness in imput variables may impact the performance of machine learning techniques.

The skewness test detect strong positive skewness in the feature"time". Correcting the skew may improve the model performances. A power transform like a Box-Cox transform might be useful. To understand the distribution of each feature of the data set independently, I visualized data with histograms.

Correlation gives an indication of how related the changes are between two variables. Neither pearson correlation nor correlation matrix found any variables that are highly correlated to each other which is good to know, because some machine learning algorithms wouldn't perform well if they exist.

I chose to standardize the data to descale it. I could also normalize the data or leave the it unchanged and check cross validation scores before making a choice.

Digital pen

As for the ordinal feature, "salary", I implemented the mapping function to have it converted to numeric values. I didn't know which algorithms would do well on the data set yet, however I did know I need to find out what features are indicative of employee's leaving or not, so this is a typical binary classification problem. The algorithms I selected are:. The confusion matrix indicated fewer errors rf made, and the classification report also provided more better performance by rf in terms of precision, recall, f1 score and support.

Last, I used rf to score each feature where the larger the score the more important the feature. You must be logged in to post a comment. NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry. Bootcamps job support and financing available.

In-person Immersive Data Science Bootcamp. Online Data Science Bootcamp. Get Inspired. Courses financing available.

Employee Attrition: Exploratory Data Analysis and Predictive Modeling using R – Part 1

View All Bundles. Bootcamp Prep. Data Science with Python. Data Science with R. Data Analyst Mastery.

1967 corvette 427 engine for sale

Data Science Mastery. View Individual Classes. Introductory Python. Data Science with Tableau. Data Science Python: Machine Learning.

100 doors 4 level 14

Data Science R: Machine Learning. About Us.An issue that every company deals with is attrition. Sales being an especially high attrition function makes this analysis paramount. Sales attrition is a result of several components including unoptimized sales compensation, unrealistic quotas, ineffective mentoring, career-path ambiguity, training inefficacy or just bad recruiting.

employee attrition data analysis in r

Hence the ability to slice and dice sales attrition many ways to understand trends and their root-causes can seriously help sales leadership make the required changes to build a healthier more performing sales force. Numerically analyzing attrition is a bit tricky. This stems from the fact that the base of employees is continually in flux.

Every month new hires join the salesforce, some employees are involuntarily terminated, some voluntarily leave the company and some others go inactive without leaving the company like when they go for a long-term leave of abscence. Additionally quality of attrition is important. Let's say two companies of about the same size lose about 25 salespeople a month. Are they experiencing the same problem? What if one company is losing more experienced salespeople whereas the other was losing mostly salespeople only months in the company.

Both these companies have wildly different problem. The first one may have an issue like their sales compensation program not rewarding top performers enough but the other one may have a recruiting issue since new hires are probably not relating their job to what they were told during recruiting.

The rate of attrition or the inverse retention rate is the most commonly used metric while trying to analyze attrition. The attrition rate is typically calculated as the number of employees lost every year over the employee base. This employee base can be tricky however. Most firms just use a start of year employee count as the base. Some firms calculate it on a rolling 12 month basis to get a full year impact.

This ratio becomes harder to use if your firm is growing its employee base. For example, let's say on Jan 1st of this year there were employees in the firm. Over the next 12 months we've lost employees.


thoughts on “Employee attrition data analysis in r

Leave a Reply

Your email address will not be published. Required fields are marked *