8 Top Books on Data Cleaning and Feature Engineering

Data preparation is the transformation of raw data into a form that is more appropriate for modeling.

It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.

Nevertheless, there are common data preparation tasks across projects.

It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more.

Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.

Even though it is a challenging topic to discuss, there are a number of books on the topic.

In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.

Let’s get started.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

The focus here is on data preparation for tabular data, e.

g.

data in the form of a table with rows and columns as it looks in an excel spreadsheet.

Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.

Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic.

We will focus on these books.

I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseData cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.

The top books on data cleaning include:Let’s take a closer look at each in turn.

The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q.

Ethan Mccallum and was published in 2012.

Bad data is described not only as corrupt data but any data that impairs the modeling process.

It’s tough to nail down a precise definition of “Bad Data.

” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats.

Sure, that’s part of the picture, but Bad Data is so much more.

[…] Bad Data is data that gets in the way.

— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.

It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.

Bad Data HandbookThe complete table of contents for the book is listed below.

I like this book a lot; it is full of valuable practical advice.

I highly recommend it!Learn More:The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.

This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically.

Nevertheless, it contains a ton of useful advice.

My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses.

I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).

— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.

Best Practices in Data CleaningThe complete table of contents for the book is listed below.

I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.

Learn More:The book “Data Cleaning” was written by Ihab Ilyas and Xu Chu, and published in 2019.

As the name suggests, the book is focused on data cleaning techniques that fix errors in raw data prior to modeling.

Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.

Rather than focus on a particular data cleaning task, in this book, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views.

— Page ixx, “Data Cleaning,” 2019.

Data CleaningThe complete table of contents for the book is listed below.

It is more of a textbook than a practical book and is a good fit for academics and researchers looking for both a review of methods and references to the original research papers.

Learn More:Data wrangling is a more general or colloquial term for data preparation that might include some data cleaning and feature engineering.

The top books on data wrangling include:Let’s take a closer look at each in turn.

The book “Data Wrangling with Python: Tips and Tools to Make Your Life Easier” was written by Jacqueline Kazil and Katharine Jarmul and was published in 2016.

The focus of this book are the tools and methods to help you get raw data into a form ready for modeling.

Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.

— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

This is a beginner’s book for those making their first steps into Python for data preparation and modeling, e.

g.

current excel users.

This book is for folks who want to explore data wrangling beyond desktop tools.

If you are great at Excel and want to take your data analysis to the next level, this book will help!— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

Data Wrangling with PythonThe complete table of contents for the book is listed below.

This is the book to get if you are just starting out with Python for data loading and organization.

Learn More:The book “Principles of Data Wrangling: Practical Techniques for Data Preparation” was written by Tye Rattenbury, et al.

and was published in 2017.

Data wrangling is used to describe all of the tasks related to getting data ready for modeling.

The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data.

— Page ix, “Principles of Data Wrangling: Practical Techniques for Data Preparation,” 2017.

Principles of Data WranglingThe complete table of contents for the book is listed below.

It’s a good book, but very high level.

Perhaps it is better suited to the manager than the practitioner.

For example, I don’t think I saw a single line of code.

Learn More:The book “Data Wrangling with R” was written by Bradley Boehmke and was published in 2016.

As its name suggests, this book is focused on data preparation with R.

In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information.

— Page v, Data Wrangling with R, 2016.

This is a practical book.

It has lots of small, focused chapters with code examples on specific problems you will encounter during data preparation.

It’s a welcome change compared to many of the other high-level books in this round-up.

Data Wrangling with RThe complete table of contents for the book is listed below.

I’m a fan of this book, and if you are using R, you need a copy.

A downside is that there is a little too much of the R basics in this book.

I would rather these beleft out and the reader directed to an introductory R book, lifting the requirements on the reader slightly.

Learn More:Feature engineering refers to creating new input variables from raw data, although it also refers to data preparation more generally.

Top books on feature engineering include:Let’s take a closer look at each in turn.

The book “Feature Engineering and Selection: A Practical Approach for Predictive Models” was written by Max Kuhn and Kjell Johnson and was published in 2019.

This book describes the general process of preparing raw data for modeling as feature engineering.

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering.

— Page xi, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

The examples in the book are demonstrated using R, which is important, as the author Max Kuhn is also creator of the popular caret package.

An important perspective taken in the book is that data preparation is not just about meeting the expectations of modeling algorithms; it is required to best expose the underlying structure of the problem, requiring iterative trial and error.

This is the same perspective that I take in general and it’s refreshing to see in a modern book.

… we often do not know the best re-representation of the predictors to improve model performance.

Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations.

Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

Feature Engineering and SelectionThe complete table of contents for the book is listed below.

I think this is a must-own book, even if R is not your primary language.

The breadth of the methods discussed is worth the sticker price alone.

Learn More:The book “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists” was written by Alice Zheng and Amanda Casari and was published in 2018.

I think this book has the most direct definitions up front of all of the books I looked at, describing a feature as a numerical input to a model and feature engineering about getting useful numerical features from the raw data.

Very crisp!A feature is a numeric representation of an aspect of raw data.

Features sit between data and models in the machine learning pipeline.

Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learn‐ ing model.

— Page vii, “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists,” 2018.

The examples are in Python and focus on using NumPy and Pandas, and there are lots of worked examples, which are great.

I think this is a good sister book or Python equivalent to the above “Data Wrangling with R” or “Feature Engineering and Selection,” although perhaps with less coverage.

Feature Engineering for Machine LearningThe complete table of contents for the book is listed below.

I like the book.

I guess I would prefer to drop the math and direct the reader to a textbook.

I would also prefer the examples to focus on the machine learning modeling pipeline rather than standalone transforms.

But I’m being picky and pushing hard for directly useful code on a given project.

Learn More:You have to pick the book that is right for you, based on your needs, e.

g.

code or textbook, Python or R.

I own all of these books, but the two I recommend are:The reason is I like practical books and I like the R and Python perspectives when I’m figuring out what to try.

A close follow-up would be:The first is super practical; the second is full of super helpful (yet super specific) advice.

For textbooks, needed for their references by most researchers, I’d probably recommend:In this post, you discovered the top books on data cleaning, data preparation, feature engineering and related topics.

Did I miss a good book on data preparation? Let me know in the comments below.

Have you read any of the books listed? Let me know what you think of it in the comments.

with just a few lines of python codeDiscover how in my new Ebook: Data Preparation for Machine LearningIt provides self-study tutorials with full working code on: Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more.

.

Leave a Reply