Machine Learning Beyond Predefined Recipes

Darwin automates three major steps in the data science process: cleaning, feature generation, and the construction of either a supervised or unsupervised model.

Each step is performed as a single generation of Darwin’s evolutionary process, which contains dozens of model architecture candidates.

At the end of each generation Darwin keeps the best performers, analyzes their architectural characteristics, and spawns a new generation of models based on these features.

This way, Darwin automatically generates thousands of models that evolve and improve with each generation to more accurately reflect the relationships in your data.

Cleaning First, Darwin needs to convert data sets into a usable form for algorithmic development.

This includes representing categorical data as numeric and extracting features that preserve temporal relationships in date/time information.

Data is also scaled to normalize data sets so features can be compared to one another.

Feature Generation Once data has been cleaned, data scientists often manipulate that data to generate more appropriate features to solve a particular problem.

One of the biggest challenges in handling dynamic time series data is determining how to window the time steps for feature generation.

Darwin automates this windowing process using one-dimensional convolutional neural networks (CNN).

CNNs are a class of deep neural networks that use a type of multilayer perceptions designed to need only minimal preprocessing.

The network instead automatically learns the filters that traditionally would need to be engineered by hand.

Darwin begins by analyzing the characteristics of the input dataset and the specified problem, and then applying past knowledge to construct an initial population of machine learning models which are likely to produce accurate predictions on the problem.

Feature Selection and Model Building Once automated cleaning and feature generation have taken place, the data set is ready to be used to build a model.

Through neuroevolution, Darwin is capable of building both supervised learning and normal behavior models.

These methods differ in how they work and the problems they solve.

For supervised learning problems, the goal of Darwin is to ingest the cleaned input data and automatically produce a highly optimized machine learning model which can accurately predict a target of interest specified by the user.

Darwin accomplishes this using a patented evolutionary algorithm which simultaneously optimizes and compares various machine learning methodologies, most heavily favoring deep neural networks.

Darwin begins by analyzing the characteristics of the input dataset and the specified problem, and then applying past knowledge to construct an initial population of machine learning models which are likely to produce accurate predictions on the problem.

Then, traits from the best-performing models are combined to yield even better models over many generations.

This ensures a final model that is highly optimized to the specified problem.

In the same way that Darwin uses an evolutionary algorithm to solve supervised problems, it is also capable of identifying relationships in data that drift over time using a technique called normal behavioral modeling.

Darwin does normal behavior modeling through an autoencoder, which is a neural network-based approach that performs dimensionality reduction.

Autoencoders compress data to reduce the feature set to the smallest size possible, and then decompress it with as little as loss possible.

  Like any other neural network, autoencoders have numerous hidden layers, a defined latent space, and different activation functions in their encoding/decoding process.

Darwin automates the creation of this network topology, and then performs backpropagation with dropout to reduce the output loss via weight optimization.

When deployed in production, the model’s ability to reconstruct data over time helps to identify shifting relationships in data.

Darwin uses this approach to build models that go beyond a traditional “risk index” and can identify anomalous operations and systems failures.

  How good is this process?.Read Darwin’s Efficacy Reportto learn more.

Carlos Pazos is a Product Marketing Manager at SparkCognition responsible for automated model building and natural language processing solutions.

.

. More details

Leave a Reply