Featuring Engineering in Python: What is a Variable?

Featuring Engineering in Python: What is a Variable?Diogo RibeiroBlockedUnblockFollowFollowingApr 29A variable is any characteristic, number, or quantity that can be measured or counted.

The following are examples of variables:Age (21, 35, 62, …)Gender (male, female)Income (GBP 20000, GBP 35000, GBP 45000, …)House price (GBP 350000, GBP 570000, …)Country of birth (China, Russia, Costa Rica, …)Eye color (brown, green, blue, …)Vehicle make (Ford, Volkswagen, …)They are called ‘variables’ because the value they take may vary (and it usually does) in a population.

Most variables in a data set can be classified into one of two major types:Numerical variablesCategorical variables========================================Numerical variablesThe values of a numerical variable are numbers.

They can be further classified into discrete and continuous variables.

Discrete numerical variableA variable which values are whole numbers (counts) is called discrete.

For example, the number of items bought by a customer in a supermarket is discrete.

The customer can buy 1, 25, or 50 items, but not 3.

7 items.

It is always a round number.

The following are examples of discrete variables:The number of active bank accounts of a borrower (1, 4, 7, …)Number of pets in the familyNumber of children in the familyContinuous numerical variableA variable that may contain any value within some range is called continuous.

For example, the total amount paid by a customer in a supermarket is continuous.

The customer can pay, GBP 20.

5, GBP 13.

10, GBP 83.

20 and so on.

Other examples of continuous variables are:House price (in principle, it can take any value) (GBP 350000, 57000, 1000000, …)Time spent surfing a website (3.

4 seconds, 5.

10 seconds, …)Total debt as a percentage of total income in the last month (0.

2, 0.

001, 0, 0.

75, …)========================================Real Life example: Peer to peer lending (Finance)Lending ClubLending Club is a peer-to-peer Lending company based in the US.

They match people looking to invest money with people looking to borrow money.

When investors invest their money through Lending Club, this money is passed onto borrowers, and when borrowers pay their loans back, the capital plus the interest passes on back to the investors.

It is a win for everybody as they can get typically lower loan rates and higher investor returns.

If you want to learn more about Lending Club follow this link.

The Lending Club dataset contains complete loan data for all loans issued through 2007–2015, including the current loan status (Current, Late, Fully Paid, etc.

) and latest payment information.

Features include credit scores, number of finance inquiries, address including zip codes and state, and collections among others.

Collections indicates whether the customer has missed one or more payments and the team is trying to recover their money.

The file is a matrix of about 890 thousand observations and 75 variables.

More detail on this dataset can be found in Kaggle’s websiteLet’s go ahead and have a look at the variables!========================================To download the Lending Club loan book from Kaggle go to this websiteScroll down to the bottom of the page, and click on the link ‘loan.

csv’, and then click the ‘download’ blue button towards the right of the screen, to download the dataset.

Unzip it, and save it to a directory of your choice.

Note that you need to be logged in to Kaggle in order to download the datasets.

If you save it in the same directory from which you are running this notebook, then you can load it the same way I will load it below.

========================================import pandas as pdimport numpy as npimport matplotlib.

pyplot as plt%matplotlib inline# let's load the dataset with just a few columns and a few rows# to speed things upuse_cols = [ 'loan_amnt', 'int_rate', 'annual_inc', 'open_acc', 'loan_status', 'open_il_12m']data = pd.

read_csv( 'loan.

csv', usecols=use_cols).

sample( 10000, random_state=44) # set a seed for reproducibilitydata.

head()Continuous VariablesLet’s look at the values of the variable loan_amnt this is the amount of money requested by the borrower in US dollars:data.

loan_amnt.

unique()Let’s make an histogram to get familiar with the distribution of the variable:fig = data.

loan_amnt.

hist(bins=50)fig.

set_title(‘Loan Amount Requested’)fig.

set_xlabel(‘Loan Amount’)fig.

set_ylabel(‘Number of Loans’)The values of the variable vary across the entire range of the variable.

This is characteristic of continuous variables.

The taller bars correspond to loan sizes of 10000, 15000, 20000, and 35000.

There are more loans disbursed for those loan amount values.

This indicates that most people tend to ask for these loan amounts.

Likely, these particular loan amounts are pre-determined and offered as such in the Lending Club website.

Less frequent loan values, like 23,000 or 33,000 could be requested by people who require a specific amount of money for a definite purpose.

Let’s do the same exercise for the variable interest rate, which is charged by the lending club to the borrowers.

data.

int_rate.

unique()Let's make a histogram to get familiar with the distribution of the variable:fig = data.

int_rate.

hist(bins=30)fig.

set_title(‘Interest Rate’)fig.

set_xlabel(‘Interest Rate’)fig.

set_ylabel(‘Number of Loans’)Again, we see that the values of the variable vary continuously across the variable range.

And now, let’s explore the income declared by the customers,that is, how much they earn yearly.

fig = data.

annual_inc.

hist(bins=100)fig.

set_xlim(0, 400000)fig.

set_title("Customer's Annual Income")fig.

set_xlabel('Annual Income')fig.

set_ylabel('Number of Customers')The majority of salaries are concentrated towards values in the range 30–70 k, with only a few customers earning higher salaries.

Again, the values of the variable, vary continuously across the variable range.

Discrete VariablesLet’s explore the variable “Number of open credit lines in the borrower’s credit file” (open_acc in the dataset).

This is the total number of credit items (for example, credit cards, car loans, mortgages, etc) that is known for that borrower.

By definition, it is a discrete variable, because a borrower can have 1 credit card, but not 3.

5 credit cards.

Let's inspect the values of the variabledata.

open_acc.

dropna().

unique()Let's make a histogram to get familiar with the distribution of the variable:fig = data.

open_acc.

hist(bins=100)fig.

set_xlim(0, 30)fig.

set_title('Number of open accounts')fig.

set_xlabel('Number of open accounts')fig.

set_ylabel('Number of Customers')Histograms of discrete variables have this typically broken shape, as not, all the values within the variable range are present in the variable.

As I said, the customer can have 3 credit cards, but not 3,5 credit cards.

Let’s look at another example of a discrete variable in this dataset: Number of installment accounts opened in the past 12 months (open_il_12m in the dataset).

Installment accounts are those that at the moment of acquiring them, there is a set period and amount of repayments agreed between the lender and borrower.

An example of this is a car loan or a student loan.

The borrower knows that they are going to pay a certain, fixed amount over for example 36 months.

Let's inspect the variable values:data.

open_il_12m.

unique()Let’s make a histogram to get familiar with the distribution of the variable:fig = data.

open_il_12m.

hist(bins=50)fig.

set_title('Number of installment accounts opened in past 12 months')fig.

set_xlabel('Number of installment accounts opened in past 12 months')fig.

set_ylabel('Number of Borrowers')The majority of the borrowers have none or 1 installment account, with only a few borrowers having more than 2.

A variation of discrete variables: the binary variableBinary variables are discrete variables, that can take only 2 values, therefore binary.

In the next cells, I will create an additional variable, called defaulted, to capture the number of loans that have defaulted.

A defaulted loan is a loan that a customer has failed to re-pay and the money is lost.

The variable takes the values 0 where the loans are ok and being re-paid regularly, or 1, when the borrower has confirmed that will not be able to repay the borrowed amount.

Let’s inspect the values of the variable loan status:data.

loan_status.

unique()Let’s create one additional variable called defaulted.

This variable indicates if the loan has defaulted, which means if the borrower failed to repay the loan, and the money is deemed lost.

data['defaulted'] = np.

where(data.

loan_status.

isin(['Default']), 1, 0)data.

defaulted.

mean()The new variable takes the value of 0 if the loan has not defaulted.

data.

head()The new variable takes the value 1 for loans that are defaulted.

data[data.

loan_status.

isin(['Default'])].

head()A binary variable can take 2 values.

For example, the variable defaulted that we just created: either the loan is defaulted (1) or not (0)data.

defaulted.

unique()Let’s make a histogram, although histograms for binary variables do not make a lot of sensefig = data.

defaulted.

hist()fig.

set_xlim(0, 2)fig.

set_title(‘Defaulted accounts’)fig.

set_xlabel(‘Defaulted’)fig.

set_ylabel(‘Number of Loans’)As we can see, the variable shows only 2 values, 0 and 1, and the majority of the loans are ok.

DiogoRibeiro7/Medium-BlogSome Jupyter Notebooks that were published in my Medium Blog – DiogoRibeiro7/Medium-Bloggithub.

com.. More details

Leave a Reply