Classic Computer Science Problems in ̶P̶y̶t̶h̶o̶n̶ Scala — Trivial Compression

Classic Computer Science Problems in ̶P̶y̶t̶h̶o̶n̶ Scala — Trivial CompressionBrian SchliningBlockedUnblockFollowFollowingMar 18I recently purchased the book Classic Computer Science Problems in Python to help me brush up on my python.

As I’m working through the Python code in the book, I’m taking some time to also port the Python to Scala.

Below is an example the illustrates trivial compression for efficiently storing genes.

A little science refresher, genes are typically made of Five nucleobases — adenine (A), cytosine ©, guanine (G), thymine (T), and uracil (U).

Uracil (U) and thymine (T) are very similar except uracil is found in RNA, thymine (T) replaces uracil in DNA.

Below I’ll show the example code from the book for compressing the in-memory representation of a DNA sequence, which can have bases of A, C, G, or T.

For those new to computer science, compression techniques allow you to use less memory to store information.

Since we only have four possible nucleotide values we can encode them using 2 bits per character: 00, 01, 10, or 11.

This will compress the same DNA information into at least 1/4 of the original memory, depending on the character set used for your String encoding.

PythonIn this code, each nucleotide is converted to its corresponding 2-bit representation and prepended to bit_string .

To decode it, we reverse the mapping; we decode each 2-bit chunk of the bit_string back to the corresponding character.

Python makes this possible as its integer type just keeps expanding and can be as large as you need it to be.

We can effectively add bits to it until our computer runs out of memory.

# Usagefrom sys import getsizeoforiginal: str = "TAGGGATCTAACCGTTATATATATATAGCCATGGATCGATTATATAGGG" * 100print("original is {} bytes".

format(getsizeof(original)))compressed: CompressedGene = CompressedGene(original)print("compressed is {} bytes".

format(getsizeof(compressed.

bit_string)))print(compressed)print("original and decompressed are same size: {}".

format(original == compressed.

decompress()))ScalaIn Scala, I’m using essentially the same type of compression.

Instead of a magically expanding integer value, like Python has, Scala provides a BitSet that stores arbitrary sequences of bits.

A BitSet is an interesting class.

You flip a bit by using its add method.

For example, bitSet.

add(1) flips the bit at position 1 in the BitSet.

You then can query if a bit is flipped or not with the apply method, e.

g.

bitSet(1) == true.

Note that in order to decode the bit set we have to track the number of nuleotides in our original DNA sequence.

I use a case class to store the BitSet and nucleotide length with each other.

Disclaimer: I’ve used an imperative style below rather than a more recursive or functional style of programming.

This was done to keep the code as conceptually-similar to the python as possible.

// Usageval original = "TAGGGATCTAACCGTTATATATATATAGCCATGGATCGATTATATAGGG" * 100val compressed = CompressedGene(original)println(compressed).

. More details

Leave a Reply