Modeling Youth Unemployment

Can a start-up understand, engage and address youth unemployment?Some backgroundUnemployment is defined as a situation where workers who are capable of working and willing to work, do not find work.

It is expressed as a ratio of the total number of unemployed persons to the total labour force.

In India, employment and unemployment is calculated by the National Sample Survey Organization (NSSO) and the Labour Bureau.

The NSSO provides estimates of unemployment using different approaches and reference periods that classify the activity status of an individual, which are used in the estimation process (Thind, 2013).

Note: The Usual status approach uses a reference period of 365 days preceding the date of survey.

The Current weekly status approach uses a reference period of 7 days preceding the date of survey, while the Current daily status approach uses each day of the preceding week as the reference periodThe problem of unemployment has been growing since Independence, and especially so for the youth.

In 2015, over 30% of India’s youth was neither employed nor in education nor training, one of the highest percentages in the world (OECD, 2017).

Several factors have contributed to the unemployment problem in India, including high population growth, slow rate of economic progress (and job creation), joint family system that makes several people depend on a few, as well as the onset of technology (School of Open Learning, 2019).

Mehta (2019) also ascribes the caste system, the prevalence of agriculture, fall of cottage and small industries (making artisans unemployed), the slow growth of industrialization, less savings and hence less investment (hence, less employment opportunities) and the shortage of electricity, coal and raw materials in India (Mehta, 2019).

The Labour Bureau attributes a part of the unemployment problem in India due to underemployment and poor wages.

They also state that among the minorities who had proper jobs in 2015, more than half earned less than Rs 10,000 per month.

Hence, any intervention in the job market needs to address both the quantity and the quality of jobs on offer (Labour Bureau, Sept 2016).

At the macro level, the economic factors that are correlated with unemployment include the gross domestic product (GDP) of the nation, population growth, government budget deficit/surplus, the share price index and the share of employees in research and development.

Lifelong learning and other education programs have a significant inverse relation with unemployment.

The demographic factors that affect unemployment include migration and years of work experience (Belen Villena Maria, 2013/14).

The job market of India is fast changing, with new industries and skillset generating demand for labor.

The pace at which technology is transforming industry puts skilled labor at risk.

Hence, up skilling and vocational training programs are necessary to help workers retain their jobs.

Also, job providers need to be matched with jobseekers, with industry bodies and government organizations playing the role of the mediator that provides accurate information that matches demand with supply.

The responsibility for reskilling and up skilling will have to be shared by the private sector through the provision of training programs as well as mentoring and counseling opportunities for workers (Kedia, 2018).

KPMG has estimated the potential for employment in different sectors in India.

Their study suggests that the number of people employed in the Agricultural Sector is expected to decrease, while the number of people employed in Building Construction & Real Estate, Textile & Clothing and Handloom & Handicraft is expected to increase in the next 3 years (KPMG, 2016).

Data collectionSeveral datasets are available online for the study of unemployment in India.

A detailed listing is provided in the Appendix.

However most, if not all these datasets are aggregates of raw data.

We searched for granular data at the city level and the individual level, and selected the following two datasets:City Data by MarketLine Advantage (2017)This provides annual data on 142 Indian cities for the years 2000–2011, spread over 47236 observations and 339 attributes under 5 primary categories: i.

Demography ii.

Education iii.

Household iv.


EmploymentThe database is well structured and provides features in different units (for instance in the local currency and in USD), as well as variables related to Unemployment (such as inflation, mean household size, mean household expenditure, etc.

) However, the dataset does not cover villages and towns, and hence rural India is missing.

Also, since the data is aggregated at the city level, it does not provide demographics data at the household level (e.


household income) or at the individual level (gender, educational qualifications, annual income, etc.

) For this we turned to a second dataset.

5th Employment Unemployment Survey (Labour Bureau, 2019)We obtained this dataset directly from the Ministry of Labour and Employment (Government of India).

It pertains to the survey conducted by the Government between April to December 2015, across 36 States and Union Territories.

The original survey covered over 700,000 individuals in over 150,000 households using a multi-stage stratified random sampling approach.

This is more granular than the MarketLine dataset and covers the following categories: i.

Age and demographics of individual ii.

Training and education details iii.

Duration of employment/unemployment iv.

Reasons for unemploymentHowever, several survey questions used by the NSSO in earlier surveys have been dropped by the Labour Bureau (including religion, land ownership and marital status) which may have been important attributes for the purpose of our study.

Also, we were not provided with the entire dataset, and around 200,000 observations have been withheld (details provided in the appendix).

We have used both the datasets in our study, for identifying the causes of unemployment amongst youth in India.

Doing so, we have arrived at a business idea that will provide meaningful work of minimum 10 hours a week for 10,000 youth over the next decade.

At the City LevelMarketLine provides data for 141 cities and 335 attributes, making it a total of 47235 observations.

Each observation has actual data from 2000 to 2011, along with forecasted data from 2012–2025.

This study considered only the actual data from 2000–2011.

Data was transformed such that each row presents all 335 attributes for a given city and year.

A snapshot of the transformed data is shown below:Exploratory data analysisAn exploratory data analysis was conducted for the years 2000–2011.

The following are the key observations from the analysis:Labor Force participation is lowest in the North, East and North-Eastern states (indicated by red and orange dots), while it is highest in the Central and South India (indicated by green dots); however, it has increased substantially across all cities since the year 2000.

Net labour force participation rate in India (2000–2011)Growth in employment in agriculture, industry and services reveals that most cities had a negative growth rate in the number of people employed in agriculture, while, employment in industry and services has grown in almost all cities (green dots).


annual growth in employment in Agriculture, Industry and Services in India (2000–2011)Average household size is negatively correlated with average household income.

Increasing the size of the household by one member, decreases the mean income of the household by approx.


26000 a year, holding everything else constant.

The red dots indicate a net increase in unemployment over the period in question, while the green dots indicate a net decrease in unemployment.

However, we find that as the average size of a household increases by one member, the unemployment rate decreases by 0.


This may point to the prevalence of family-run enterprises in India, where the entire family lives and works together.

Taken together, the graphs may indicate that as the size of the household increases, members have a higher chance of finding employment, but the average income per person in the household actually decreases.

Gurgaon, Lucknow, Raipur and Ghaziabad have recorded the highest average increase in unemployment between 2000–2011.

Gurgaon witnessed an increase in annual unemployment despite the IT boom.

Kholapur, Ratnagiri, Kannur and Mumbai has decreased unemployment between 2000–2011.

A scatter graph of unemployment growth and inflation for 2000–2011 shows a positive correlation between the two variables.

For every 1 unit increase in the CPI index across the cities, growth in unemployment decreases by an average 1.

82% holding everything else constant.

A positive correlation is seen between unemployment and the annual growth rate in secondary education.

For every 1% increase in the number who have completed secondary education, unemployment is seen to be increase by an average 0.

03% holding everything else constant.

A positive correlation is seen between unemployment and gross value added by the public services sector.

For every 1% increase in GVA (Public Services) in a city, unemployment increases by 0.

02%, holding everything else constant.

A negative correlation is seen between the proportion of youth in the city and average unemployment rate in the years 2000–2011.

This may indicate that the younger the population in city, the higher the chances of finding work, everything else held constant.

The red dots indicate a net increase in unemployment between 2000–2011, while the green dots indicate a net decrease in unemployment.

Using the above visualizations, a dashboard was built in Tableau to allow Municipal Administration to get a snapshot of unemployment related variables in a city.

These may aid city officials to take decisions in real-time, in order to improve the employability in a city.

The dashboard can be viewed here: https://public.



devatha#!/Analytical models on City-level dataBased on the above analysis, the following models were deployed on the MarketLine data:i.

Principal Component Analysis to reduce the dimensionality of the data (raw data has 335 attributes), followed by K-Means Clustering to cluster the citiesii.

Random Forest Regressor to identify the important features that determine the level of unemployment in a city for subsequent modelsiii.

Multiple linear regression to identify the causes of unemployment at the State level.


A logistic regression model to determine the factors that lead to an increase (or decrease) in the annual unemployment ratev.

Classification models (Logistic Regression, Decision Tree, Random Forest and Naïve Bayes Classifier) that attempt to classify whether unemployment will increase or decrease, based on the values of relevant attributesAfter iterating through the above, we finalized the following approach for analyzing this dataseti.

Use Random Forest Regressor to obtain the important attributes (among the 337 attributes in the original data)ii.

Apply Multiple Linear Regression to determine the impact of these factors on the unemployment rateiii.

Use Logistic Regression to classify whether unemployment will increase, or decrease based on the annual values of these features.

Variable selectionThe Random Forest Regressor model (using the Naïve Bayes approach) is good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories.

Unlike linear models, random forests can capture non-linear interaction between the features and the target.

It is a type of additive model that makes predictions by combining decisions from a sequence of base models.

More formally we can write this class of models as follows (TURI, 2019):g(x) = f0(x) + f1(x) + f2(x) + …where the final model g is the sum of simple base models f(i).

Here, each base classifier is a simple decision tree.

This broad technique of using multiple models to obtain better predictive performance is called “model ensembling”.

In random forests, all the base models are constructed independently using a different subsample of the data (Wikipedia, The Free Encyclopedia, 2019)The objective of using Random Forest Regressor on this data is to find the important features that may be used in the Regression models.

Although PCA gives the important components it does not rank the features.

We have used K-fold Cross Validation to ensure unbiased sampling and combined the important features from all of them which will later be used in the Regression models.

The model is fit on Unemployment Rate (%) as the response variable.

A total of 33 variables come out as the important attributes (out of a total of 337 features) as listed below:However, this model suffered from multicollinearity, with the age variables having a variation inflation factor >= 40, while all the other variables have a VIF <= 2.

Hence, we re-fitted the model after removing the age-related variables.

The final regression output is shown below.

The above features were used to build a model that would predict the level of unemployment in a State.

An accuracy of 90% was obtained on the validation dataset.

However, the impact of each of these variables on the unemployment rate is not available through this model.

As a result, we turned to regression models.

Model A: Identifying Causes of Unemployment at the City-levelThe features obtained from the Random Forest Regressor, as well as features from domain knowledge were used in a multiple linear regression model with the unemployment rate as the dependent variable.

Data from 2000–2011 for all 141 cities was selected, and an Ordinary Least Squares method was used to fit the model.

The following variables turned out as significant:The model has an adjusted R-square of 74.

90% and a significant F-statistic.

The coefficients indicate that with each 1 % increase in the number of females, unemployment rate in a given State decreases an average 6%, holding all other variables constant.

We also see that for each 1% increase in the proportion who have completed further education, unemployment rate increases by an average of 3% holding everything else constant.

This may be due to the growth in the number of universities, but insufficient growth in demand for labour.

A plot of the residuals shows they are randomly distributed.

There are several influential observations but most of them appear to be genuine cases, and we chose not to remove them from our analysis.

Also, the residuals follow the normal distribution except it is fat tailed confirming the fact that there are outliers in the data.

MODEL B: Classifying change in Unemployment at the City-levelThe change in the annual unemployment rate can be represented as a binary variable (where increase=1 and decrease=0) and use machine learning to classify whether a given city will witness an increase or decrease in unemployment based on the current year.

We used a confusion matrix to test the performance of the classifiers.

The accuracy measures can be interpreted as below:We tested 4 different machine learning algorithms for the purpose of classification.

Results obtained on training and test data are shown below:In order to take preventive measures, the Sensitivity of the classifier is important.

Decision Tree and Random Forest classifiers have the highest sensitivity (100%) and more data may be necessary in order to validate these results.

For the remaining two classifiers, (Logistic regression and Naïve Bayes), we constructed ROC curves in order to visualize their performance, as shown below.

The foregoing analysis was conducted on city level data aggregated at an annual level, over a 12-year time period.

Next, we modeled more granular data at the individual level.

At the individual levelThe Ministry of Employment and Labour (Government of India) conducted the 5th Employment and Unemployment survey between April to December 2015, across all 36 States and Union Territories of India.

The original survey covered over 700,000 individuals in over 150,000 households using a multi-stage stratified random sampling approach.

This is more granular than the MarketLine dataset and covers the following categories: i.

Age and demographics of individual ii.

Training and education details iii.

Duration of employment/unemployment iv.

Reasons for unemploymentThe raw Labour Bureau data is spread across 5 “blocks”, where each Block represents one part of the survey undertaken.

The survey collected data over 90 such attributes.

We first replaced the survey labels with the actual labels for ease of interpretation, as obtained from the survey questionnaire, as well as from the NIC (Central Statistical Organization, 2008) and NCO code lists (Ministry of Labour & Employment, 2004).

A unique code for an individual was derived as a concatenation of the unique household code and the serial number of the individual surveyed.

Using this as the primary key, blocks 1, 2, 3A, 3B, 4A and 5A were merged in Python for the purpose of this study.

Exploratory data analysisA total of 156,564 households were surveyed over 6 months.

The number of households selected from each state followed a multi-stage stratified random sampling approach.

The total household surveyed in each State is shown here.

The average size of the households surveyed are shown below, where the color represents the number of members over the age of 15 (lighter the color, more the number).

We see that the households surveyed in Meghalaya and Lakshadweep are the largest, while those surveyed in Andhra Pradesh and Tamil Nadu are the smallest.

Uttar Pradesh has the most number of households included in the survey followed by Maharashtra.

The average annual earnings of the individuals surveyed across the different social groups is presented below.

The income bracket of the largest proportion of each social group has been highlighted for reference.

Our concern pertains to youth in the 15–35 age bracket, we subset the data for this age group, and explored further.

The educational level of youth surveyed is shown here.


20–23% of the surveyed youth have secondary and higher secondary education.

Vocational Training experience amongst the surveyed youth is given below.

We see that those with undergraduate and post graduate degrees are more likely to undertake vocational training to augment their skills.

Income distribution of the surveyed youth reveals that in each income bracket, females receive a lower share than their male counterparts, as shown below:There is a total of 14 work status used by the Labour Bureau in the survey.

The work status of youth surveyed is given below.

We see that most of the youth surveyed “attended education”.

Over 50% of females surveyed attended to domestic duties.

Also, 5.

27% of youth did not work and have been classified as “seeking work”, “unable to work” and “others” (as highlighted).

Amongst the surveyed youth looking for work, 63% are male and 37% are female.

The different ways in which they look for work is shown here.

Amongst the surveyed youth, the reasons for not finding a job is given below.

Of all those who did not work because of child commitments at home, 57% are female and 42% are male.

Also, a large proportion of these youth have a “Arts and Humanities” background (65%) with no vocational training experience (82%)The survey has also queried relevant female youth for staying out of the workforce, under 6 categories: child at home, financially well off, no job nearby, social problems, transport unavailable and other reasons.

The proportions in each category are given below, along with the significant proportions in red color font.

Analytical Models using individual-level dataBased on the above analysis, the following models are deployed on the Labour Bureau dataset:i.

Machine learning algorithms to classify an individual as unemployed or not based on his/her demographic factorsii.

Logistic Regression to determine some of the causes of unemployment among the youthiii.

Multiple Linear Regression to determine the monetary effect of finding workwe consider a youth as being unemployed if he/she falls in the following categories: Did not work but seeking/available for work & Others (begging, prostitution, etc.

)After sub setting for youth, we get a dataset of 303549 individuals surveyed across 36 States and Union Territories.

Of these, only 14289 individuals are unemployed as per the above categorization, giving an unemployment rate of 4.

94% amongst those surveyed.

This becomes a skewed dataset, and we have balanced this with over-sampling for the purpose of the classification algorithms.

Doing so, the Unemployed class increases to 24.

699% of the dataset.

This is divided into training and test datasets in the 75:25 ratio, and the resulting datasets are using in the subsequent models.

MODEL A: classifying youth as unemployedThe following attributes were selected for the classification exercise: Social Group (backward class, scheduled class/tribe, other), Age (in years, 15–35 age group only), Gender (male, female, transgender), Vocational training (binary variable, yes/no), Up skill (binary variable, yes/no), State (categorical variable), Education years (number of years of education experience), Unemployed (class variable (1=unemployed, 0=employed).

The survey conducted by the Labour Bureau considers education as a categorical variable.

We have converted this into a numeric variable as follows:We trained six supervised learning algorithms for classifying youth based on their demographic factors: 1.

Logistic regression as a classifier 2.

K-Nearest Neighbor 3.

Decision Tree 4.

Support Vector Machines 5.

Deep Neural NetworksThe above models were trained on both the datasets (original Labour Bureau dataset, as well as the over sampled dataset), and their performance was validated on the corresponding test datasets.

In order to evaluate performance, we used the confusion matrix, with the following accuracy measures:We are primarily concerned with the sensitivity meaure, of correctly classifying a given individual as “unemployed”.

Logistic regression and Support Vector Machine classifiers have performed the poorest on this measure, while the Decision Tree and Deep Neural Network classifiers perform with over 98% accuracy on the validation dataset.

The accuracy measures obtained above indicate that by using supervised machine learning, the government can classify a youth as being unemployed solely on his/her demographic factors.

Some of these factors are beyond an individual’s control, but several others can be influenced to reduce the chances of being unemployed.

For instance, by offering vocational training programs, increasing the number of years of free education, and making it easier for youth to migrate to other States, the government may be able to reasonably address youth unemployment in the country.

At the same time, if a social entrepreneur knows which are the factors that lead to unemployment, then he/she may be able to arrive at a business solution to the same problem.

For this, we turn to logistic regression models.

MODEL B: Factors that cause unemployment amongst youthWe defined two logistic regression models to determine the causes of unemployment:Model 1: Unemployed ~ Age + Gender + Education + Social Group + Up skill + State + VTModel 2: Unemployed ~ Age + Gender + Education + Social Group + Up skill + State + VT_FieldThe regression models were run on the original datasets (without over sampling), and diagnosed accordingly.

Results are shown below:The above indicates that both the models are good fits for the data as shown by the Hosmer and Lemeshow goodness of fit test.

There is also no multicollinearity among the independent variables, as the highest variance inflation factor is State at 4.

14 and 4.

3 in the two models respectively (Kassambara, 2018).

Details of some of the coefficients from Model 1 are presented below (all State variables insignificant variable):The dependent variable is “Unemployed” expressed as a binary variable.

The results show that amongst the surveyed youth (15–35 years), with each passing year in age, the probability of being unemployed decreases by 4% ceteris paribus.

It also shows that each year of education decreases the probability of being unemployed by 12%, everything else being held constant.

The regression results also show that compared to the reference category of a youth who has no vocational training, ceteris paribus, the chances of being unemployed reduces by 91% with vocational training.

This shows that government can give more emphasis to the schemes relating to vocational training and ensure that more youth get access to these programs.

We also revisit this later with our business idea.

Given that certain demographic factors affect the employment status of the youth, we used multiple linear regression models to determine the monetary effect of employment.

MODEL C: Monetary effect of finding workWe defined two regression models as given below.

Both models consider the monetary effect of finding employment on the annual income of the individual.

The difference between the two models is that in Model 1 all employment types are considered, while in Model 2, the employment status is replaced by a binary variable indicating unemployment.

Model 1: Average individual earnings ~ Age + Gender + Education years + Social Group + Vocational Training (Y/N) + State + Employment typeModel 2: Average individual earnings ~ Age + Gender + Education years + Social Group + Vocational Training (Y/N) + State + Unemployed (Y/N)The regression models were run on the full datasets and diagnosed accordingly.

Results are shown below:By replacing the different employment types with a binary variable (Unemployed=1, employed=0), the R-square of the model has decreased.

Both models pass all other diagnostic tests.

We choose Model 1 for further analysis as it separates the different employment types.

Details of some of the coefficients from this model are presented below.

The coefficients above indicate that on average, after accounting for the other variables, the average annual income of a youth increases by INR 61 a year.

It also shows that the average annual income of female youth is lower than that of their male counterparts, among those surveyed, holding everything else constant.

Considering the number of years of education, the data suggests that, ceteris paribus, average annual income of an individual can be expected to increase by INR 121 for each year of educational experience that he/she has.

Among the employment statuses, the reference category is “Others (beggars, prostitute, etc.


The coefficients show that on average, holding everything else constant, having any kind of paid work has a significant positive impact on the annual average income of an individual.

The coefficients also show that those attending to domestic duty, attending to education and those seeking work have an annual income less than that of the reference category.

The two categories that are not significant in this model are “Rentier, pensioner” and “Unable to work due to disability”, both of which have the same annual average income as the reference category.

Also, the coefficients for the self-employed categories is the highest of all employment types, indicating that any business solution for finding work for the unemployed youth is likely to have the most impact if they are empowered to run their own business.

The coefficients for the State variable is shown below:The reference state is Delhi; the coefficients above indicate that holding everything else constant, the youth surveyed in Delhi have the highest annual income in the country.

Hence, if the unemployed youth are trained to run their own business, then the youth in the capital city would become a viable market for selling their goods/services.

Business Idea for a social enterpriseThe exploratory data analysis of MarketLine data revealed that between 2000 to 2011, the labor force participation has been decreasing in North India compared to the south.

Also, the number of people employed in Agriculture is fast decreasing, as opposed to Services and Industry.

This could indicate the growing demand for goods and services (as compared to farm and agricultural produce), and a profitable business idea would tap into this growing market.

The regression models on the same dataset showed a negative correlation between secondary education and unemployment.

This was also confirmed by the logistic regression model on the 5th EUS dataset, which indicates that with everything else held constant, each year of formal education reduces the chances of being unemployed by 12%.

The same model also shows that those who have received vocational training are on average, 91% less likely to be unemployed ceteris paribus.

Hence, education and vocational training must be part of any plan that attempts to address unemployment.

A business initiative that provides vocational training as part of its model can thus have a positive impact on the unemployed & underemployed youth of the country.

Also, a cross section of the cities in 2011 revealed a positive correlation (35.

43%) between unemployment and gross value added by the public service sector.

Although the dataset pertains solely to urban India, this correlation indicates the need to move away from public services, and towards private initiatives for addressing unemployment.

The exploratory data analysis on the Labour Bureau dataset revealed that 0.

08% of the surveyed youth were socially outcast (“beggars, prostitutes, etc.


The proportion of the surveyed youth in this category in each State is shown in th table.

Over 23% of the surveyed youth in Tamil Nadu were beggars, prostitutes, etc.

followed by Uttar Pradesh at over 15%.

There are several government run rehabilitation centers for these youth, but quality audits have found that these are of poor standards, and worse than prisons and jails (Azad India Foundation, 2019).

Proportion of surveyed youth in the “Other (beggars, prostitutes, etc.

)” categoryAt the same time, literature review (KPMG, 2016) suggests that the handloom and handicrafts industry is growing.

This is $400 billion industry, of which India has less than 2% market share (India CSR Network, 2013).

This sector also has a large pool of skilled artisans, and a low cost of distribution via e-commerce.

There are also several government schemes and initiatives that are catered for reviving this sector (Development Commissioner (Handicrafts), 2019)Based on the above, we believe a worthwhile business would target the handicraft sector as the area of intervention, and the socially outcast as suppliers of labor.

We propose that youth beggars are adopted for rehabilitation, and select youth are trained by seasoned artisans as part of the rehabilitation process.

Those that excel can be mentored to set up their own micro enterprise and sell their ware in the global marketplace.

They can also be encouraged to build systems that will employ the other disconnected youth, with direct or indirect employment.

They will need coaching to set up a sustainable business, and get trained in basic business skills, micro financing & government funding.

They would also need to be evaluated and upskilled every 3 years as the case may be.

The success of such a social enterprise can be measured by the number of lives that are touched through the ecosystem.

A financial plan to scale up such an initiative to 10000 youth over the next ten years was constructed.

Our model shows that a well-run business in this sector can generate over INR 50 crores over ten years in profit before tax.

The plan rehabilitates, trains and engages up to 10,000 youth.

Using a conservative estimate of INR 1 lakh in sales per youth, as well as a 70% commission to youth on every item sold, we arrive at an income expense schedule as given below.

The net present value of future cash flows (assuming a risk-free rate of 10%) amounts to INR 27.

52 crores.

ConclusionUnemployment among the youth is a cause of concern in India today.

Rapid urbanization, as well as the slow rate of job growth has made unemployment an area that needs a pro-active intervention.

In this project, we studied the problem of youth unemployment in India using a literature review and exploratory data analysis in order to understand the domain, machine learning algorithms to determine whether unemployment is on the rise or decline in a city, whether a youth member is unemployed or not based on his/her demographics, followed by regression models to determine the causes of youth unemployment in India, as well as to evaluate the monetary impact of unemployment.

This analysis was conducted first at a city-wide level (using MarketLine City Advantage dataset), and then at the individual level (using the 5th EUS Survey dataset).

Based on the findings, we arrived at a business plan for a social enterprise in the handicraft sector, which would rehabilitate disconnected youth and provide vocational training for selected youth in reviving the dying handicrafts of India.

In so doing, we address 3 Sustainable Development Goals viz.

no poverty, education and well-being and decent work and economic growth.

The models used in this project can be reused when new data becomes available from the Labour Bureau and MarketLine City Advantage.

The data will need to be cleaned and preprocessed before running the models on them.

Also, if the survey questionnaire is updated, then the models will need to be adjusted accordingly.

Photo by Eric Ward on UnsplashChallenges facedWe have used the tools and techniques pertaining to Statistical Analysis, Data Mining, Business Fundamentals, Data Collection, Text Analysis, Data Visualization and Deep Learning using R, Python, Microsoft Office, Canva and Tableau in the course of this project.

Some of the challenges that we faced:- The MarketLine dataset is freely available but is small and aggregated at the city level.

Hence, we were not able to use this dataset for analyzing unemployment at the individual level.

Also, the dataset presents only 11 years in actuals (from 2000–2011), making it too small a dataset to use for forecasting.

– The Ministry of Employment and Labour has not shared the entire dataset with us, and has withheld 2 lakh observations.

Having the entire dataset would make the models more accurate.

However, we assumed that the data provided is a random sample of the surveyed population, and proceeded accordingly.

– The Labour Bureau dataset is large and some of the classification algorithms took a very long time to converge (viz.

SVM algorithm in R)ReferencesAzad India Foundation.

(2019, Jan 1).


Retrieved April 7, 2019, from Youth Ki Awaaz: https://www.


com/2008/04/beggary-in-india/Belen V.


, C.



Statistical Analysis of Unemployment in Europe.

Technische Hochschule Nürnberg Georg Simon Ohm, Nürnberg.

Belen Villena Maria, C.



Statistical Analysis of Unemployment in Europe.

Technische Hochschule Nürnberg Georg Simon Ohm, Nürnberg.

BI India Bureau.

(2019, Mar 26).

Job creation in India slows down by 6.

9% in January: ESIC payroll data.

Retrieved April 22, 2019, from Business India Insider: https://www.



cmsBI India Bureau.

(2019, Feb 6).

Unemployment in India is so bad that over 150 MBAs and engineering graduates applied for a handful of sanitation jobs.

Retrieved April 22, 2019, from Business India InsiderCentral Statistical Organization.


National Industrial Classification (all economic activities).

Ministry of Statistics and Programme Implementation.

New Delhi: Government of India.

Development Commissioner (Handicrafts).

(2019, March 20).




Textiles, Producer, & Government of India) Retrieved April 7, 2019, from Development Commissioner Handicrafts: http://www.



in/Economic Times .

(2014, Nov 18).

India has world’s largest youth population: UN report.

(Economic Times, India Times) Retrieved April 26, 2019, from Economic Times: https://economictimes.



cmsFrey C.


& Osborne, M.


(Sept 2013).








(2019, Jan 1).

List of Identified as Endangered Craft.

Retrieved April 2019, from http://handicrafts.



pdfHULT Prize.



Retrieved April 23, 2019, from HULT Prize: http://www.


org/challenge/India CSR Network.

(2013, May 22).

The Declining Legacy of India — Rural Artisans: Report.

Retrieved April 07, 2019, from IndiaCSR: https://indiacsr.


(2018, March 11).

Multicollinearity Essentials and VIF in R.

Retrieved April 21, 2019, from STHDA — Statistical tools for high-throughput data analysis: http://www.


com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/Kedia, S.


(2018, Oct 18).

5 ways tech can transform the future of work in India.

Retrieved April 23, 2019, from World Economic Forum: https://www.



(2016, Jan 1).

KPMG Environmental Scan 2016.


Retrieved from http://www.



pdfLabour Bureau.

(2019, January).

5th Employment Unemployment Survey Data.

Chandigarh, India.

Labour Bureau.

(Sept 2016).

Report on Fifth Annual Employment — Unemployment Survey (2015–16).

Ministry of Labour and Employment.

Chandigarh: Government of India.

Labour Bureau.

(Sept 2016).

Report on Youth Unemployment — Employment Scenario Vol 2.

Ministry of Labour & Employment, 5th Annual Employment Unemployment survey 2015–16.

Chandigarh: Government of India.

MarketLine Advantage.

(2017, June 1).

City Data.

MarketLine Databases .

London, United Kingdom.

Mehta, P.

(2019, Jan 1).

Main Causes of Unemployment in India.

Retrieved Feb 14, 2019, from Economics Discussion: http://www.


net/articles/main-causes-of-unemployment-in-india/2281Mehta, P.

(2019, Jan 1).

Main Causes of Unemployment in India.

Retrieved Feb 14, 2019, from Economics Discussion: http://www.


net/articles/main-causes-of-unemployment-in-india/2281Ministry of Labour and Employment.

(2019, March).

Round 5 — Schedule B.

Annual Employment-Unemployment Survey 2015 .

Government of India.

Minstry of Labour & Employment,.








, 1968.

Government of India.

Misra, S.


(2014, June 1).

Estimating Employment Elasticity of Growth for the Indian Economy.

Retrieved April 22, 2019, from Reserve Bank of India: https://www.






OECD Economic Surveys India.


Picarelli, S.

(2017, Oct 6).

India’s workforce is growing — how can job creation keep pace?.Retrieved April 26, 2019, from World Economic Forum: https://www.


org/agenda/2017/10/india-workforce-skills-training/Ruthven, O.

(2017, 09 11).

How to ‘Skill India’ When the Jobs are Bad.

Retrieved 02 14, 2019, from The Wire: https://thewire.

in/economy/skill-india-narendra-modi-jobs-in-india-unemploymentSchool of Open Learning.

(2019, Jan 1).

Meaning of unemployment.



Learning, Producer, & University of Delhi) Retrieved April 23, 2019, from Eonomics: https://sol.




php?id=1266&chapterid=933Thind, M.


(2013, July 2).

Unemployment measurement in India .



Service, Producer) Retrieved April 23, 2019, from Arthpedia: http://www.




(2019, Jan 1).

Random Forest Regression.

Retrieved April 20, 2019, from https://turi.


htmlWikipedia, The Free Encyclopedia.

(2019, March 14).

Ensemble learning.

Retrieved April 21, 2019, from Wikipedia, The Free Encyclopedia: https://en.


org/wiki/Ensemble_learningSources of dataIf you are interested in exploring this topic further, a few sources of data are given below.

Due to formating limitations, we are pasting this as 2 screenshots (and the URLs will not be clickable).

. More details

Leave a Reply