“Deep learning” about transcription factor-DNA binding

“Very well” or “not very much” are probably not the best labels.

And ChIP-seq comes to the rescue!ChIP-seq stands for chromatin immunoprecipitation (ChIP) assays combined with next-generation sequencing.

If you have no idea what this means, here’s a refresher.

High profile sequencing tells us what genes are active, and how much they are transcribedBasically, ChIP sequencing (ChIP-Seq) is a powerful method for identifying genome-wide DNA binding sites for transcription factors and other proteins.

It tells us where the DNA will bind to.

Notice the “peaks” or spikes on the graph = lot of gene activity in that areaNow, the peak regions are the black/colored parts on the graph that essentially mean there is a lot of gene expression in that specific area.

Each peak has a signal value (the number of sequence reads that come from that genomic location).

The signal value quantifies how well the TF will bind to the DNA.

What can we do with ChIP-seq?If we compare ChIP-seq peak regions of diseased and non-diseased cells, we can determine the genetic variants that cause or contribute to that disease!Another way we can use it is to determine how gene expressions change from cell type to cell type — this could give us a huge insight into how we can differentiate induced pluripotent stem cells!How do will we use machine learning?What if we could predict, given a specific genetic sequence, the signal value for a specific transcription factor?Basically:Input: raw DNA sequence (with no prior knowledge on variants)Output: real-valued ChIP-seq signal valueThe machine is going to learn this through deep learning.

So today, we’re going to be applying a convolutional neural network algorithm.

You’ve seen them excel at image classification, but now see them excel at genomic data analysis.

If images are just a matrix of numbers technically genomic data is just a matrix of A, T, C, and Gs.

And they’re excellent at pattern recognition.

Overview of the deep learning modelFirst, raw DNA sequences are inputted into the model.

Each input sequence is converted into a one-hot matrix with 4 rows and 300 columns.

The 4 rows represent the four base pairs (A, C, T, G) while the sequences are 300 b.

p.

(nucleotides) long.

Because DNA is a double helix, and TFs can recognize either strand of DNA at a given location, the model is fed both the forward sequence and the reverse complementary sequence.

It would be a more accurate representation if one of the spidermen were flipped upside down.

Again, because the TFs can recognize either strand of DNA at a given location, convolutional layers for both sequences share the same set of filters.

Using the architecture proposed by DeFine, the architecture looks something like this:The convolutional neural network architecture in DeFine.

First, we have two convolutional layers which will automatically extract features from the dataThey’re followed by ReLU layers that filters results that are above the threshold (learned during training)Then they simultaneously go through max pooling and average pooling layers.

Max pooling outputs the most prominent activation signal in a sequence for each filter, while average pooling considered the whole sequence context by taking the average of the filter scanning results at each position in the sequenceBoth outputs are combined into one vectorIt then goes through batch normalizationThen a fully connected layerA drop out layer (with a probability of 0.

5) is then applied — help mitigate overfittingA final fully connected layer againAnd finally a regression layer — which predicts a ChIP-seq signal intensityNow let's go back to how we preprocessed the data.

(I know, everyone’s favOURIte part about machine learning).

Data preprocessing!The awesome ENCODE (Encyclopedia of DNA Elements) project compiled an opensource database of many cell lines with all kinds of sequencing.

They have a couple of cell lines that have ChIP-seq and whole genome sequencing.

I will be using the K562 cell line (with 79 transcription factors).

The whole genome-sequencing comes from here.

The reference genome is GRCh37.

Note: the model must be trained separately on each transcription factor.

To prepare the data for training, the peak regions (signal values) of each transcription factor were extracted from peak calling results.

So now we have the signal values of the TF for each region that shows gene activity.

Top 1% signal values for each TF were discarded (outliers with extremely high signal values)Then the signal values are then log transformed and normalized by min-max scaling between 0–1.

We do this to make our numbers easier to work with!Then, the genomic sequences of the peaks were extracted from the reference genome according to the peak regionsPeak sequences were fixed to 300 bp (which we can then input into our model) through adding Ns (stands for any nucleotide) or deleting nucleotidesFinally, to help with data augmentation we do a final step:Randomly take sequences from regions with no binding of known TF and set their signal values as zeroThe # of randomly selected sequences with a zero value should equal the amount of ChIP-seq peak sequences with a signal intensity valueLastly, the data was split 70% for training, 15% for validation and tuning the hyperparameters and 15% for testing.

Evaluating AccuracyWith this model, a simple percentage accuracy would not cut it.

We’re going to bring in some statistical correlation measurements!This model will be evaluated based on both the Pearson correlation coefficient and Spearman correlation coefficient.

They both measure how two numbers are associated with each other.

The higher the correlation = the stronger the relationship between the two!.Here’s a link for a quick recap of the two terms.

You just “deep learned” about how to predict transcription factor-DNA binding!Just to recap…Gene expression controls everything that happens in our bodyGene expression is regulated by transcription factors binding to regulatory elements (DNA)Deep learning can help us predict how likely a transcription factor will bind to DNA — by predicting a ChIP-seq intensity valueComparing ChIP-seq values (or binding affinity of TFs to DNA in general) can help us understand the different variants that cause disease, help us understand stem cell differentiation, and generally teach us more about gene expression!The code behind this project will be posted on my GitHub soon, stay tuned!Have any questions?If you have any questions, feel free to reach out at the following:Email: gracelyn@gracelynshi.

comLinkedin: https://www.

linkedin.

com/in/gracelynshi/Twitter: https://twitter.

com/GracelynShiWebsite: gracelynshi.

comSources[1] M.

Wang, C.

Tai, L.

Wei, DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants (2018) https://academic.

oup.

com/nar/article/46/11/e69/4958204.. More details

Leave a Reply