Annotating gene coordinates and gene lists — The python way

Here, I’ll describe how toCreate a source dataset from GENCODE Gene Set release (section A)Find genes in a list of input genomic coordinates using the source dataset (section B)Find genomic coordinates of a gene list using the source dataset (section C)Creating the GENCODE source dataset and using it to annotate lists of genes and coordinatesSection A: Creating a source dataset from GENCODE Gene Set releaseSTEP 01: Read the gff3 file into a pandas dataframegencode = pd.read_table(<path to .gff3>, comment="#", sep = " ", names = ['seqname', 'source', 'feature', 'start' , 'end', 'score', 'strand', 'frame', 'attribute'])gencode.head() gencode.info()There are nine columns in a gff3 file..They are seqname, source, feature, start, end, score, strand, frame, attribute..Comments in gff3 files start with “#”.STEP 02: Extract Genes in the gff3 file “feature = gene”gencode_genes = gencode[(gencode.feature == "gene")][['seqname', 'start', 'end', 'attribute']].copy().reset_index().drop('index', axis=1)This code shows how to extract GENCODE genes..However, if you need to extract other GENCODE feature types like transcripts, exons, CDSs and UTRs, this can be changed accordingly..For example, if you need to access transcripts, you can replace “gene” with “ transcript” (gencode.feature == “transcript”).STEP 03: Extract gene_name, gene_type, gene_status, level of each genedef gene_info(x):# Extract gene names, gene_type, gene_status and level g_name = list(filter(lambda x: 'gene_name' in x, x.split(";")))[0].split("=")[1] g_type = list(filter(lambda x: 'gene_type' in x, x.split(";")))[0].split("=")[1] g_status = list(filter(lambda x: 'gene_status' in x, x.split(";")))[0].split("=")[1] g_leve = int(list(filter(lambda x: 'level' in x, x.split(";")))[0].split("=")[1]) return (g_name, g_type, g_status, g_leve)gencode_genes["gene_name"], gencode_genes["gene_type"], gencode_genes["gene_status"], gencode_genes["gene_level"] = zip(*gencode_genes.attribute.apply(lambda x: gene_info(x)))This function reads GENCODE “attribute” column and access all key-value pairs..Then the filter command extracts relevant values using the gene_name, gene_type, gene_status and level keywords..Depending on the previous step (step 02), you can change the code..For example, you might need to access transcript_type, transcript_status and tag if you extracted all the transcripts in step 02..Similarly, you can change step 04 and 05 accordingly.STEP 04: Extract all known protein_coding genesgencode_genes = gencode_genes[gencode_genes['gene_status'] == 'KNOWN'].reset_index().drop('index', axis=1)gencode_genes = gencode_genes[gencode_genes['gene_type'] == 'protein_coding'].reset_index().drop('index', axis=1)STEP 05: Remove duplicates — Prioritize verified and manually annotated loci over automatically annotated locigencode_genes = gencode_genes.sort_values(['gene_level', 'seqname'], ascending=True).drop_duplicates('gene_name', keep='first').reset_index().drop('index', axis=1)Level attribute in the GENCODE gff3 file provides an assessment of the reliability of a given feature..Here, level 01 indicates that the element is manually annotated and experimentally verified (Eg. genes and transcripts by RT-PCR-seq and pseudogenes by three different methods)..Level 02 means that the feature is annotated manually by HAVANA and merged with features predicted by ENSEMBL automated pipeline..Lastly, level 03 indicates that the element is predicted only by the ENSEMBL automated pipeline.As described above, step 04 identified all the known protein-coding genes.. More details

Leave a Reply