Optimizing Source-Based-Language-Learning using Genetic Algorithm

Most would choose option #2 — because a sentence is closer to what I call a ‘cashable result’.

Let’s look at some stats of this naive approach:Learning top 100 most occurring Root Words in the Quran:Total % of Corpus Covered: ~~64% of the Quran# of Verses Completed: 91# of Words Covered: 520Avg Length of Completed Verses: 5.

71 words per verseSo, though we will have covered ~~64% of the Quran we will only have 91 verses (tangible, cashable rewards) under our belt after learning the 100 different root words.

I think we can do better.

The question becomes: What are 100 root words I can learn such that, it will maximize the amount of complete verses I can understand?No problem, let’s try all the combinations, calculate how many verses they complete, and then pick the one that covers the most verses.

Even with a reduced set of only 1250 Root words, we have a total of 8.

987762e+149 combinations (1250 choose 100).

yes problem.

Cue Genetic Algorithm.

GA Solution (and how a GA works)GAs are great for this kind of crazy combinatorial stuff.

A more detailed overview of what a GA is and how a GA can be programmed can be found here.

For now, I will skip the technical details and give a generic overview of what we are doing.

We randomly create a population of multiple possible solutions (each solution is called a chromosome)— so, we pick out various combinations of 100 Root Words (RWs) out of the corpus.

Each chromosome essentially has a different gene structure (i.

e.

a different combination of RWs).

So if a solution has a root word X, we say the chromosome has gene X ‘selected/activated’.

We calculate each solution’s fitness (think evolution theory)— this is a score of how ‘good’ a solution is given a specific objective.

In our case, it is relative to how many verses this specific set of 100 RWs will complete.

Verses with more words will be weighted higher.

Select the ‘best’ (as in, those with highest fitness scores) solutions as parents and breed by crossover — this is exactly as you would imagine it, you take 2 parent chromosomes, split both of them at some random point and join the first end of one with the second end of the other.

This creates children chromosomes.

We now have new possible solutions.

To add some variety, we then mutate the children by randomly activating or de-activating random genes (i.

e.

randomly add or remove any RW from a solution).

After crossover (step 3) and mutation (step 4), we may have more than 100 RWs in any given solution.

to fix this we go through a repair step.

If a solution has more or less than a 100 RWs, we add/remove random RWs until it only has 100.

Repeat step 2–5 multiple (like thousands of) times and slowly, you will start breeding better and better chromosomes (i.

e.

higher fitness scores).

In this way, even without exploring all the combinations, we will gradually start moving towards high performing solutions.

Recall that our fitness scoring is based on how many verses the combination of RWs completes — therefore, we will end with combinations that are optimized to complete the most verses as possible.

Through this process, the GA will favour selecting root words that lead to completed verses.

What results does a process like this yield?.After running for 1.

5 hours (though you could leave it running for days even), with a population size of 5000 chromosomes, we get following stats:Learning 100 Root Words in the Quran selected by GA:Total % of Corpus Covered: ~~55% of the Quran# of Verses Completed: 226# of Words Covered: 1262Avg Length of Completed Verses: 5.

58 words per verseWe sacrifice total corpus coverage (which is a superficial metric of reward) for a whooping total of 226 completed verses (cashable reward full of capital) spanning 1262 total words from a mere 100 unique root words — compare that to 91 verses spanning 520 total words in our naive solution.

We’ve already made some headway — creating a set of 100 beginner RWs to learn to maximize our reward — but this is still not proper DIY-style self-learning because we’d have to learn 100 words to reap the rewards… not something that can be done in a single sitting.

And in proper fashion, each sitting must give rewards to keep us motivated— we are impatient, greedy, hungry amateurs after all.

Using GA Solution to create Optimized Lesson PlansWe can use the above GA approach to create optimized lesson plans — each lesson gives us some homework (learning RWs) designed to reap the most rewards as possible (# of completed verses).

We simply run the GA incrementally, selecting 10 RWs at a time.

Sample Set of 10 Lesson Plans (w/ 10 RW to learn in each)I’ll present a beginner set of 10 Lesson Plans generated from incrementally running the GA for some hours.

Each optimized lesson requires you to learn 10 RWs.

Ideally, each lesson can be completed in a single sitting (though it will require some time and perhaps review until you move on to the next one).

Total % of Corpus Covered (after all lessons): ~~51%Total # of Verses Completed: 228% of Corpus Covered per Lesson:1 12.

03%2 8.

14%3 7.

85%4 2.

01%5 6.

83%6 3.

08%7 3.

55%8 1.

79%9 2.

75%10 3.

20%Interestingly, incremental runs of the GA does not decrease our performance significantly (total coverage from 55% to 51%, but total # of completed verses from 226 to 228), hinting at well-optimized solutions.

Lesson Plan Overview StatsRoot Words to Learn per LessonClicking the link under the Details column will direct you to corpus.

Quran which presents the meaning/context of the root word in order to learn/study it.

Completed Verses (Rewards) per LessonThis table displays the verses that should be ‘accessible’ after the RWs in each lesson have been learned.

Note that there are 2 different links that show the verses.

Unfortunately, note that links under corpus.

Quran, though more detailed and annotated, do not have all the verses up yet and will often fail (redirect to 1:1).

Concluding Remarks (and how to learn from this)The GA does what a computer does best, but it is still left to individuals to actually learn whatever it is we want to learn.

This could also serve as a resource for teachers to formulate curriculums.

Effort is obviously required to follow through.

The only reassurance we have after this is that our effort shall not (hopefully) be wasted, misdirected, or lost in an abyss of information.

As I mentioned earlier, the lack of a long-term overarching structure is partially what defines this approach.

Some final remarks on how one could approach self-learning with these guidelines:There are prefixes and suffixes that often attach to the root words to define its meaning — these are not explicit ‘root words’ on their own.

The idea is that when you learn the basics, and then approach trying to understand the verses, you will pick up the grammar, prefix, and suffix rules over time through pattern recognition (even if you don’t formally know them).

In fact, you will probably search up the grammar rules at some point but it will be on a need-to-know basis.

5-min long youtube videos on the basics of Quranic Arabic is probably a good preface before you dive into the described DIY approach.

Processes of pattern recognition will also help you to understand the forms of the sentences and structures of phrases.

This works well with this approach as we generally start with smaller verses with simple patterns.

As you learn more words, you approach longer verses which are made up of those simple patterns.

Patterns will repeat often until you’re confident you understand the rule.

There are always exceptions, but they are not to fret over.

A DIY approach begins with understanding the often-recurring.

Making sense of exceptions can come on a later day when you’ve picked up enough of the generic stuff that obscurity doesn’t seem so overwhelming.

It should be fun — from the momentum of ‘moving forward’.

If you’re not getting results, find another way.

Self-learning is too virtuous of a desire to contaminate with monotonous feelings.

.. More details

Leave a Reply