Recreational programming: a zero dependency Zipf’s Law adventure

It’s a handful of words and a hyperlink, I don’t think cross-browser concerns should be a part of this site, especially not to the point of importing a whole library.

The “next word…” hyperlink doesn’t even have an event listener or an href value, so clicking it literally just refreshes the page.

That library is another 20 KB.

So that’s 270 KB of imports for a site with zero functionality that just renders a tiny bit of static content.

I don’t know about you, but I think that’s a smidgeon excessive.

Lady Gaga’s costumes are more conservative than this site’s dependencies.

With this site fresh in my memory when I was coming up with my idea for this project, I decided to see if I could do it with zero module/package dependencies.

To see what we could get out of raw HTML, Javascript and CSS, as well as raw Python in the back end.

The only thing I used was Node 8 and Python 3.

(As a side note, if you want to see the other side of the coin then have a look at this video, where a Kickstarted NES project called “Micro Mages” had to fit their entire game into just 40 KB.

They employed all sorts of ingenious tactics to help reduce the memory footprint of their game.

Definitely puts things in perspective, and should encourage you to respect the extra disk space, RAM and internet bandwidth we have now.

There’s a difference between leveraging our technology’s capabilities, and just being lazy and wasteful.

)Project reviewI’m not going to go through all the source code here, you can view the code on Github here.

If you want to give the webapp or the Zipf’s Law Python script a try yourself, have a look through the README.

md, it’ll tell you what you need to do.

Ultimately I wrote around 600 lines of code for this project, with another 150 or so for the pure CSS loader I borrowed (stole) from a sample online.

The entire code base was collectively 24 KB unminified and uncompressed.

The most time consuming part was definitely the Node server.

I was unaware of just how much Express abstracted away until I had to start manually sending static files and making sure every tiny little error didn’t completely crash the server.

The two Python scripts (one for the webapp backend, and another for looking at multiple text files) were much easier, being only 60–80 lines each (25% of which were comments).

The most foreign experience was definitely the actually application script itself.

Almost all the front end work I’ve ever done is using some kind of framework (predominantly React).

But now I found myself manually creating, adding and removing elements with vanilla Javascript.

Doing that definitely helped explain jQuery was so popular.

For the loading animation that shows while waiting for an uploaded file to be processed I simply Googled some pure CSS loading examples and picked a cool looking one, which can be found here.

I modified it slightly so that instead of 0s and 1s falling into the hole it was various words.

Simple as that, 3 KB of CSS and you have a cool, unique loader.

No need for 100 KB loading GIFs involving multi-coloured spinners and dancing bananas.

As for the graph, I just imported ChartJS….

NO! No I didn’t.

We ain’t using dependencies here, no siree.

All I needed was some horizontal bars, for which I could just use coloured in divs.

So how do we do this? Well with a rudimentary calculation in the Python script, we can see what percentage each word is of the total count, and then normalize those values between 0 and 100 so that the most common word is 100% and we keep the occurrence ratio between words.

Then, that value can simply be used as the CSS ‘width’ property in the front end.

Let’s have a look at the final product for the webapp, showing top 10 results from a sample Gutenberg text file.

For funsies I made it look like a site from the 90’s, which means plenty of disgusting neon colours!Beautiful :’)Zipf resultYou probably saw the middle column in the above picture, which looks like the position of the word in the list and then some decimal number.

Well that brings us to the thrilling conclusion of this rollercoaster of an article: the results from my Zipf investigation!The first is obviously just the word’s position in the list, and the decimal number is actually the ratio of that word’s count to the count of the most common word.

According to Zipf’s Law, the two numbers should be roughly equal.

Now in the above picture they aren’t at all, but that was just from a single small text file (just a few hundred KB in size), done through the webapp.

So let’s break out the second Python script which looks at multiple files and start off with 35 MB worth of text files:(The right column is the total frequency of that word, equivalent to the purple number in the webapp screenshot)So we can see that there were just over 6 million words counted all up.

However, there seems to be a curious pattern developing.

After the first couple of words, the ratio seems to be roughly half of what it should be.

Maybe my sample size wasn’t big enough…So I tried again with 254 MB of text files, but again the result persisted:There’s over 40 million words here, and I made a reasonable effort to filter out any default Gutenberg text, punctuation, or other things which may interfere with the result.

Finally, just to confirm I went up to 584 MB:There’s just under 95 MILLION words in this sample corpus, thousands of text files, but that pattern is still there.

Weird.

However, after doing this larger sample size I printed off the top 20,000 results, rather than top 20 or so, and started scrolling through them.

I noticed that the distance between the position and the ratio was slowly narrowing, from ½ to ⅔ to ¾.

I kept scrolling and lo and behold, I found the reconciliation point:At around the 8700th word, it starts to be as it should, i.

e.

the position roughly matches the ratio.

Success!….

right?.Well let’s scroll to the end:Ah fiddlesticks.

It seems that it’s gone off in the opposite direction.

Now it’s 50% MORE than what it should be.

Also, I learnt a new word: hoo.

It’s allowed in Scrabble!Anyway, it seems that within the first couple of words it quickly drops to around half of what it should be, and from then on, over thousands of words, it gradually increases, surpassing where it should be and continuing upwards.

The increase almost seems linear too.

I could get a few gigs of files and do the top 100,000 words, see where it ends up then, but we’ve already got our answer to an initial investigation.

It doesn’t really seem that Zipf’s Law was holding up here.

There were still ways I could improve the word counting algorithm.

For example if a word was entirely numbers it was still treated it as a word and put it in the list:But there was nothing nothing major I could see that would account for the result.

So, why wasn’t Zipf’s Law showing itself?Is it that my sample corpus wasn’t sufficient or appropriate?.It was mostly books, so maybe they don’t represent natural enough language as required to show Zipf’s Law.

Or maybe almost 95 million words still wasn’t a big enough sample size.

Could it be an issue with my algorithm?.The calculations were quite simple and there wasn’t much involved, though it could be I made a logic error, or misinterpreted Zipf’s Law when converting it to code.

Is Zipf’s Law just a conspiracy, invented by the Illuminati (in collusion with big pharma, obviously), designed to distract us from their creation of The New World Order?.Possibly.

I’m not going to rule it out.

Ultimately, I’m curious about my results and will likely follow them up, maybe make a post somewhere to seek help in explaining my figures.

I might even try applying the same algorithm to other data sets that purport to exhibit Zipfian distribution.

For now, it was fun to create a zero dependency project looking into something I found interesting, even if it led to an anticlimactic article.

What I learnedIt doesn’t take much to do a little recreational project like this.

All in all, investigating Zipf’s Law and writing the code took me around 12 hours total, including some commenting and refactoring I did for the benefit of anyone wanting to have a quick look through it.

Recreational projects can be lots of fun, particularly when you make sure not to take them too seriously.

Pick something you saw in a YouTube video or heard mentioned in a podcast and see if you can explore it by hacking something together using the languages/tools of your choice.

Old habits die hard.

I found I often slipped back into excessive refactoring and commenting when doing this, even though it wasn’t required.

Projects like this can be great learning experiences.

Using swathes of libraries and modern tools, as many web developers do, means that lots of fundamental stuff is abstracted away from you.

Doing everything vanilla can help you gain a deeper understanding of the technologies you work with.

Using no dependencies helps you remind yourself that you’re more capable than you give yourself credit for and that any imposter syndrome you may suffer from is most likely in your head.

If you’re not careful, web development can devolve into stapling together a bunch of modules rather than writing your own code, but doing projects like this can help remind you of just how much you can accomplish without dependencies and encourage you to try and do things for yourself whenever possible, rather than just adding in external modules willy nilly.

Dependencies have their place.

On one hand they’re often overused, but on the other hand doing everything vanilla can be slow, and there’s definitely factors like scaling and extendability that my project code sorely lacks (which many modules may have baked in).

As with all things, moderation is key.

Dependencies should be for things that are too large in scope for you to practically do, or things that you are just plain not capable of doing.

.

. More details

Leave a Reply