Building a carbon molecule autoencoder

Unlabeled data is perfect for unsupervised learning, where input data does not need to come with a preassigned corresponding list of labeled datasets.Using the unlabeled carbon catalog on Pubchem, I built an deep autoencoder with which the latent layer composed of what the neural network (NN) thought as the most important details in the unlabeled SMILES dataset.Project Carbon CodedAutoencoders are a type of neural network architecture whereby output is intentionally trained to be as similar to input as possible.If you input a picture of a flower, the output should look as similar to the original flower as possible.The architecture of a vanilla autoencoderIt sort of has a aesthetically pleasing symmetrical shape to it..Data is compressed into the latent layer, colored red, similar to zipping a file..Autoencoders however, aren’t necessarily better than other compression algorithms..So we have an algorithm that outputs something very similar to its input, and that isn’t really better than other compression algorithms; What are Autoencoders good for then?Autoencoders are useful for a specific parts of its architecture; the hidden layer..Also known as the latent layer, it is essentially a condensed and concentrated layer of the data’s most important features.This makes autoencoders prefect for dimensionality reduction and noise reduction.The project is broken down into 4 parts:Importing and normalizing the SMILES string dataTranslating normalized strings into one hot vectorsBuilding the deep NN modelCompiling the model and fitting the dataThe dataset is a list of over 12,000 carbon based molecules..A single element was chosen so that the autoencoder’s latent layer could learn features that made carbon molecules unique compared to any other elements..Carbon was also chosen as it is one of the most versatile elements, hence having a larger dataset with greater variety, a benefit in reducing overfitting.Just a couple examples of carbon based molecules in the datasetThe autoencoder is trained to recognize the loss between its input and output, so that the model learned to best replicated the given input..The input layers is comprised of 63 nodes, passing through 2 more filters of size 32 and 14..Each layer approximately halves the number of nodes until it reaches the bottleneck of 7 nodes in the latent layer..Upon completing the encoding process, the operation is inverted and the number of nodes per layer of the decoder increase symmetric to the encoder..The trained model therefore has a firm grasp on the unique structure of carbon molecules.One can think of it like a bottle of that magic essence that makes carbon so versatile.Once trained, the latent layer can be used in a generative model, turning the vanilla autoencoder into a variational autoencoder (VAE)..Theoretically this VAE would be able to generate new carbon molecules some of which may be useful in material science, nanotechnology, or biotechnology.. More details

Leave a Reply