Deep Learning for Data Integration

Deep Learning for Data IntegrationNikolay OskolkovBlockedUnblockFollowFollowingMay 12Image sourceSynergistic effects of data integration with Deep LearningThis is the third article in the series Deep Learning for Life Sciences.

In the previous two posts, I showed how to use Deep Learning on Ancient DNA and Deep Learning for Single Cell Biology.

Now we are going to discuss how to utilize multiple sources of biological information, OMICs data, in order to achieve more accurate modelling of biological systems by Deep Learning.

Biological and biomedical research has been tremendously benefiting last decade from the technological progress delivering DNA sequence (genomics), gene expression (transcriptomics), protein abundance (proteomics) and many other levels of biological information commonly referred to as OMICs.

Despite individual OMICs layers are capable of answering many important biological questions, their combination and consequent synergistic effects from their complementarity promise new insights into behavior of biological systems such as cells, tissues and organisms.

Therefore OMICs integration represents the contemporary challenge in Biology and Biomedicine.

In this article, I will use Deep Learning with Keras and show how integrating multi-OMICs data reveals hidden patterns not visible in individual OMICs.

Single Cells make Big DataThe problem of data integration is not entirely new for Data Science.

Imagine we know that a person looks at certain images, reads certain texts and listens to certain music.

Image, text and sound are very different types of data, however we can try to combine those types of data in order to build e.

g.

a better recommender system which achieves a higher accuracy of capturing the interests of the person.

As for Biology and Biomedicine, the idea of data integration has only recently arrived here, however it was actively developed with the biological angle resulting in several interesting methodologies such as mixOmics, MOFA, Similarity Network Fusion (SNF), OnPLS/JIVE/DISCO, Bayesian Networks etc.

Integrative OMICs methodsOne problem which all the listed above integrative OMICs methods face is the curse of dimensionality, i.

e.

inability to work in high-dimensional space with limited number of statistical observations, which is a typical setup for biological data analysis.

This is where Single Cell OMICs technologies are very helpful as they deliver hundreds of thousands and even millions of statistical observations (cells) as we discussed in the previous article, and provide thus truly Big Data ideal for integration.

Single cell multi-OMICs technologies.

Image sourceIt is very exciting that such multi-OMICs single cell technologies as CITEseq and scNMTseq provide two and three levels of biological information, respectively, from exactly the same cells.

Integrating CITEseq data with Deep LearningHere we will perform unsupervised integration of single cell transcriptomics (scRNAseq) and proteomics (scProteomics) data from CITEseq, 8 617 cord blood mononuclear cells (CBMC), using Autoencoder which is ideally suited for capturing highly non-linear nature of single cell OMICs data.

We covered advantages of using Autoencoders for Single Cell Biology in the previous post, but briefly they are related to the fact that single cell analysis is essentially unsupervised.

We start by downloading CITEseq data from here, reading them with Pandas and log-transforming, which is equivalent to a mild normalization.

As usually, rows are cells, columns are mRNA or protein features, last column corresponds to cell annotation.

Now we are going to build an Autoencoder model with 4 hidden layers using Keras functional API.

The Autoencoder has two inputs, one for each layer of information, i.

e.

scRNAseq and scProteomics, and corresponding two outputs which aim to reconstruct the inputs.

The two input layers are separately linearly transformed in the first hidden layer (equivalent to PCA dimensionality reduction) before they are concatenated in the second hidden layer.

Finally, the merged OMICs are processed through the bottleneck of the Autoencoder, and finally the dimensions are gradually reconstructed to the initial ones according to the “butterfly” symmetry typical for Autoencoders.

Unsupervised integration of CITEseq dataIn the code for the Autoencoder below, it is important to note that the first hidden layer imposes severe dimensionality reduction on the scRNAseq from 977 to 50 genes, while it leaves the scProteomics almost untouched, i.

e.

reduces dimensions from 11 to 10.

The bottleneck further reduces the total 60 dimensions after concatenation down to 50 latent variables which represent combinations of both mRNA and protein features.

A very handy thing here is that we can assign different loss functions to OMICs coming from different statistical distributions, e.

g.

combining categorical and continuous data we can apply categorical cross entropy and mean squared error, respectively.

Another great thing about data integration via Autoencoders is that all OMICs know about each other as the weights for each node / feature are updated through back propagation in the context of each other.

Finally, let us train the Autoencoder and feed the bottleneck into tSNE for visualization:Effect of CITEseq data integration: to see patterns invisible in individual OMICsComparing the tSNE plots obtained using individual OMICs with the tSNE on the bottleneck of the Autoencoder that combines the data, we can immediately see that the integration somewhat averages and reinforces the individual OMICs.

For example, the purple cluster would be hard to discover using the scRNAseq data alone as it is not distinct from the blue cell population, however after integration the purple group of cells is easily distinguishable.

This is the power of data integration!Integrating scNMTseq data with Deep LearningWhile CITEseq includes two single cell levels of information (transcriptomics and proteomics), another fantastic technology, scNMTseq, delivers three OMICs from the same biological cells: 1) transcriptomics (scRNAseq), 2) methylation pattern (scBSseq), and 3) open chromatin regions (scATACseq).

The raw data can be downloaded from here.

scNMTseq data integration with AutoencoderThe architecture of the Autoencoder is analogous to the one used for CITEseq with only one peculiarity: Dropout regularization is used on the input layers.

This is due to the fact that we have only ~120 cells sequenced while the dimensionality of the feature space is tens of thousands, so we need to apply regularization to overcome the curse of dimensionality.

Note that this was not necessary for CITEseq where we had ~8K cells and ~1K features, so exactly opposite situation.

Nevertheless, overall scNMTseq is not an easy case for data integration, I firmly believe though that this is just the beginning of single cell multi-OMICs era and many more cells will arrive soon from this exciting technology, so it is better to be prepared.

Combining transcriptomics with epigenetics information for scNMTseqHere out of curiosity I fed the bottleneck of the Autoencoder that combines the three scNMTseq OMICs into Uniform Manifold Approximation and Projection (UMAP) non-linear dimensionality reduction technique which seems to outperform tSNE in sense of scalability for large amounts of data.

We can immediately see that the homogeneous in sense of gene expression blue cluster splits into two clusters when scRNAseq is combined with epigenetics information from the same cells (scBSseq and scATACseq).

Therefore it seems that we have captured a new heterogeneity between cells which was hidden when looking only at gene expression scRNAseq data.

Can this be a new way of classifying cells across populations by using the whole complexity of their biology?.If so, then the question comes: what is a cell population or cell type?.I do not know the answer for this question.

SummaryHere we have learnt that multiple sources of molecular and clinical information are becoming common in Biology and Biomedicine thanks to the recent technological progress.

Therefore data integration is a logical next step which provides a more comprehensive understanding of the biological processes by utilizing the whole complexity of the data.

Deep Learning framework is ideally suited for data integration due to its truly “integrative” updating of parameters through back propagation when multiple data types learn information from each other.

I showed that data integration can result in discoveries of novel patterns in the data which were not previously seen in the individual data types.

As usually, let me know in the comments if you have a specific favorite area in Life Sciences which you would like to address within the Deep Learning framework.

Follow me at Medium Nikolay Oskolkov, in twitter @NikolayOskolkov, and check out the codes for this post on my github.

I plan to write the next post about Bayesian Deep Learning for patient safety in clinical diagnostics, stay tuned.

.

. More details

Leave a Reply