Deep Learning on Ancient DNA

With the current advances in Next Generation Sequencing (NGS) we can extract DNA from ancient bones, sequence it and understand a lot about the past through various types of statistical and Population Genetics analyses.

However, modern contamination is a big issue in Ancient DNA research area.

Svante Pääbo, the founder of paleogenetics, admitted to have sequenced mostly modern DNA (presumably his own) in his famous work about DNA from a Egyptian mummy that gave rise to the field.

Indeed, by just looking at modern and ancient sequences you would never guess which one came from the past.

Therefore researchers in this area use advanced statistical analysis such as deamination pattern inference, where mapDamage tool is very handy.

However, the problem with deamination analysis is that it is based on averaging across thousands and millions of aligned sequences, therefore it is only feasible if you have a deeply sequenced (a lot of ancient DNA needed) sample and a reference genome to get the sequences aligned to the genome which is not always available.

This is exactly where we can utilize the power of Machine and Deep Learning for detecting DNA motifs typical for ancient and modern sequences.

Deep Learning for GenomicsSince DNA sequence is essentially a “biological text”, it can be analyzed using approaches from Natural Language Processing or Time Series data analysis.

Artificial Neural Networks (ANNs) are widely used in both areas and show state-of-the-art performance for Genomics as well.

For example, Convolutional Neural Networks (CNNs) are working horses for functional Genomics, see e.

g.

an excellent review of Argenmueller et al.

, Molecular Systems Biology (2016) 12, 878, where they detect with high accuracy transcription factor binding sites and splicing sites.

For practical implementation of CNNs for Genomics I recommend the fantastic “A Primer on Deep Learning in Genomics” from Zou et al.

Nature Genetics 51, pages12–18 (2019).

The figure below from the paper describes a typical Deep Learning workflow in Genomics.

CNN for Genomics from Zou et al.

Nature Genetics 51, pages12–18 (2019)Here I show how to build a Convolutional Neural Network (CNN) based classifier for a per-sequence prediction of ancient status of a DNA sequence without mapping to a reference genome.

But let us start with looking at Ancient DNA sequences.

Looking at Ancient DNA SequencesFor demonstration purposes I will be using the draft Neanderthal genome which is a low-coverage sequencing effort back in 2010.

The file is an alignment (bam-file) of the DNA sequences to a Human Genome hg18, but it does not have to be an alignment, we could equally well use the raw sequences (fastq-file).

We can read the sequences in Python using handy pysam module and plot the distribution of lengths of the sequences.

It is obvious that there is a wide spread of DNA sequence lengths from approximately 20 nucleotides (nt) to 140 nt long sequences (reads in bioinformatics jargon).

This is a known fact which tells us that the Ancient DNA is degraded, i.

e.

it is broken with time into small segments.

Therefore the length of a sequence can be very informative for inferring whether it is ancient or not, the shorter the higher is the likelihood that it is ancient.

However, in this analysis I am specifically interested in ancient DNA motifs no matter what length of the sequence is.

Therefore I will perform the analysis on equally long modern and ancient sequences and leave the sequence length information as a pleasant bonus which everyone can use in addition to the ancient pattern information.

Since I am going to use the modern French DNA sample which in 2010 was sequenced with read length equal to 76 nt, I will select only 76 nt long Neanderthal sequences and then an equal amount of modern sequences for training my CNN.

In total we got almost half a million reads, 50% of the m are ancient.

For simplicity I selected only reads from first 10 chromosomes.

This amount of data is truly big, compare with ~60 000 examples from canonical MNIST and CIFAR10 Machine Learning data sets widely used for practicing image recognition with CNNs.

Preparing Genomics Data for CNNNext we are going to perform standard for all one dimensional CNNs steps such as one-hot encoding of the sequences and splitting the data into training and testing sub-sets.

Here I will follow the protocol from Zou et al.

Nature Genetics 51, pages12–18 (2019):Now everything is ready for running CNN, let us do it!Construct and Run 1D CNN ClassifierFinally, it is time to start training a simple binary (ancient vs.

non-ancient) 1D CNN classifier.

I will crate a simple one-block VGG-like architecture with two Convolutional layers followed by oneMax Pooling layer.

I will also apply small L1 norm weight regularization to prevent overfitting.

VGG-like architecture of 1D CNN classifierThe training was performed on ~300 000 ancient and modern sequences took approximately 5 hours on my laptop with 4 cores.

Let us check how the loss and accuracy curves behave:The model is clearly overfitting but still reaches impressive 84% of accuracy on the validation data set (~100 000 sequences).

The oscillating behavior of the validation curves is typical for L1 norm regularization.

To make it more stable Dropout is more preferable, we can also increase the batch size of reduce the learning rate.

Let us now perform the final evaluation of the model on the hold out test data set with approximately 133 000 sequences:The final evaluation of the model shows again 84% of accuracy of prediction of the ancient status.

The confusion matrix confirms this number.

What about interpretation of the CNN?.Saliency maps is an elegant method to demonstrate feature importances of the CNN, i.

e.

which nucleotides are most informative for the prediction of ancient status for each sequence:We clearly see that the ancient vs.

non-ancient signal seems to come from the ends of the sequences.

This makes a lot of sense as it is known that the ends of the ancient sequences are exposed to deamination and thus degradation which leads to increased frequency of C/T and G/A polymorphisms at the ends of the reads compared to the null-hypothesis of uniformly distributed frequency of all other substitutions.

This is what mapDamage captures.

However, in contrast to mapDamage, we were able to detect this hidden deamination pattern without comparing the ancient sequences to the reference genome.

Moreover we now have a model which predicts “ancient or non-ancient” for each read while mapDamage uses statistical inference via averaging across thousands of reads.

Usually, it is very difficult to extract a lot of Ancient DNA, so every read matters.

In other words, we can not afford calculating statistics across a number of reads but we want an ancient status for each and every valuable read.

In addition, analysis of ancient microbiome (reads originating from ancient microbes rather than the human/animal host) often comes up with ancient viruses which typically have a dozen of reads aligned to the very short viral genome.

In this case it is extremely hard to perform deamination analysis with mapDamage due to the lack of statistical power.

SummaryHere I demonstrate the importance of Deep Learning for Ancient DNA research area.

I showed how to create a simple Convolutional Neural Network (CNN) classifier of ancient vs.

non-ancient status of each DNA sequence without reference genome.

I hope you will find this post useful and get inspired to develop Deep Learning for exciting Ancient DNA research area.

Follow me in twitter @NikolayOskolkov and check out the complete Jupyter notebook on my github.

.

. More details

Leave a Reply