How Computers See: Image Recognition and Medieval Pole Arms

The answer, for most neural networks, is a process called “back-propagation”.

It’s called this because it involves the final layer, which predicts the class “feeding back” information to the first layer.

The first layer is made up of a set of very simple operators, which are called “neurons”.

They get this name because the way they operate mimics, at a basic level, the way the brain’s neurons function.

But don’t be fooled by the sophisticated-sounding name — their operation is incredibly simple.

The neurons look at the pixels in the image, and based on those pixels’ values, they pass on a single signal of their own.

When the neural network is first created, the neurons choose their thresholds completely at random — they make a guess.

The network makes a prediction for a whole set of training data, and then it checks how well it did.

The neurons which provided useful information get to keep their values, but those that led the network astray have their values adjusted.

Over very many iterations, the neurons are slowly trained to discern what information to keep, and what to discard.

By the end of the training process, they have learned what features to pass on to the final layer to maximise its chances of predicting the right class.

To achieve this for our pole arms, we’ll need a very large set of images of pole arms, each labeled with the correct name.

We also need a wide variety of different pole arms of each class.

We want to make sure that the network is learning the general principles of what distinguishes say, a fauchard from a bardiche, and not just learning to recognise details of the particular images we chose.

Gathering this dataset turned out to be a huge amount of work, not just because I had to painstakingly search, crop, and filter dozens of images by hand, but because it turns out that there’s no consensus on what any of these things are called.

The average medieval peasant, it seems, was more concerned with staying alive than with the correct nomenclature for the weapon that helped them do so.

A “fauchard” then, is confidently defined by one source as a cousin of the glaive, with the addition of a rear-facing spike or hook.

Another, equally authoritative source claims the same weapon is a modified scythe — a forward-curving blade on the end of a pole.

I needed an authoritative source, and as I have done many times before, I turned to the Advanced Dungeons and Dragons 2nd Edition Player’s Handbook (Revised).

Here is how that august tome defines the seven classes of weapon my model will classify:Bardiche: One of the simplest of pole arms, the bardiche is an elongated battle axe.

A large curving axe-head is mounted on the end of a shaft 5 to 8 feet long.

Bec de corbin: An early can-opener designed specifically to deal with plate armor.

The pick or beak is made to punch through plate, while the hammer side can be used to give a stiff blow.

The end is fitted with a short blade for dealing with unarmored or helpless foes.

Fauchard: An outgrowth of the sickle and scythe, the fauchard is a long, inward curving blade mounted on a shaft six to eight feet long.

Glaive: One of the most basic pole arms, the glaive is a single-edged blade mounted on an eight- to ten-foot-long shaft.

Guisarme: Thought to have derived from a pruning hook, this is an elaborately curved heavy blade.

Glaive-guisarme: Another combination weapon, this one takes the basic glaive and adds a spike or hook to the back of the blade.

Halberd: Fixed on a shaft five to eight feet long is a large axe blade, angled for maximum impact.

The end of the blade tapers to a long spear point or awl pike.

On the back is a hook for attacking armor or dismounting riders.

Those definitions will infuriate many military historians, I am sure, but for me they will suffice.

As well as collecting dozens of examples of each weapon type, I expanded my dataset in another way, by “synthesising” extra images.

That meant taking my existing images, flipping and stretching them, shifting them left and right, and speckling them with random noise.

This meant that each single image I collected could be included many times in my dataset.

By using stretched, manipulated, and especially flipped images, we help the algorithm focus on general shapes and relationships in the images, rather than specific details.

Finally, the images are desaturated (all colour is removed), and shrunk to only 40 pixels on each side.

This reduces the amount of data the algorithm has to consider, and dramatically increases the speed at which it can learn.

Like many things in machine learning, it takes a bit of tinkering to make it work right.

Neural networks take a range of parameters and settings, governing several esoteric aspects of their operation.

A network can have more than two layers, for example, boiling down the raw data into an ever-richer soup of meaningful features.

I will spare you the details of this.

Choosing these values is still a somewhat arcane practice, more akin to alchemy than chemistry.

But the testing process is exactly the same as for any other classification algorithm.

Before we train the model, we set aside a proportion of the images, hiding them from our model.

These hold-outs don’t need to go through the process of stretching, speckling, and flipping.

Instead, we use them to check the accuracy of our neural network.

After learning what it can from a set of training images, can it correctly predict the classes of a set of images it has never seen before?.For each image, we ask the network to guess its class.

It returns a list of probabilities — how likely it is for each image that it belongs to a given class.

Predicted classes for each of 25 test images — the image is shown next to the predicted probabilities of it belonging to each class.

The correct class is in blue, and the label is red if the class was incorrectly predicted.

In most cases, the model predicts the correct class with close to 100% probability.

Our algorithm is extremely accurate!.It makes a correct prediction for all but one of our forty or so test images — identifying a bec-de-corbin as a halberd.

With simple classes such as these, and such clear and small images, the recognition task is very straightforward for our algorithm.

With such good accuracy, it’s difficult to trust that the network has really learnt to identify pole arms, and is not just memorising some trivial facet of the images we’ve supplied to it.

Is there a way we can dig into the internal workings of the algorithm, to better understand how it arrives at its predictions?You’ll recall from a previous essay, about extracting meaning from text, that we could use a series of mathematical operations to turn a collection of words describing a movie into a set of numerical values which contained some representation of the “meaning” of that movie.

With our neural network we’ve done something very similar — taken an image of a pole arm, and turned that into numerical representations of some meaningful information about that image.

It’s just a series of numbers, but it contains some information about the shapes in the image.

It turns out that these numerical representations of our pole arms have some very interesting properties, and they can help us understand more about how the network makes its classifications, and what exactly it has learned.

Just like with our movies, we can calculate similarity between our pole arm representations (or “embeddings”).

We can measure the difference between the numerical values, and use that to find the most or least similar images.

That reveals some interesting stuff.

If we simply measure the difference between the pixels in the image, we find images which are superficially similar, but which can represent quite different pole arms.

By contrast, finding similar embeddings can find images that are quite different, but which represent a similar design of weapon.

In the above image, we’ve taken a typical guisarme, and found the most similar images based on image similarity, i.

e.

how many pixels they have in common, and based on embedding similarity, i.

e.

the difference between their numerical representations.

Based on image similarity, the closest match is not a guisarme at all, but rather a fauchard, which happens to occupy a similar space in the frame.

But the embedding similarity has found another guisarme.

Interestingly, it’s found this guisarme despite the fact that it’s facing the opposite direction to the original.

That demonstrates something called rotational invariance.

Because we trained our model on flipped, stretched, and speckled transformations of our source images, it learnt to ignore those factors — it has learnt that a guisarme is a guisarme regardless of whether it faces left or right.

Another thing we can do with these embeddings of our images is to calculate averages.

For example, we can take all the embeddings for our glaives, and take the average of them.

That gives us a new embedding which represents the “glaiviest” possible glaive — the “ur-glaive”.

But we can’t really see what that glaive looks like — it’s just a string of numbers.

What we can do is find the glaives that are closest to that ideal.

We can take the glaives from our test set and order them from most to least “glaivy”.

I’ve done that in the image below, with the most similar on the left, and the least on the right.

The left-hand glaive is extremely simple, and has no unusual features.

By contrast, the glaive on the right has all kinds of strange features — hooks and spikes.

It could almost pass as a volgue or bardiche.

You’ll notice though that the order of the images doesn’t quite match how you or I might sort them.

The fourth glaive, for example, looks to me very similar to the first, though the embeddings are apparently not so similar.

This is an important reminder — the network works in mysterious ways.

Though its results might sometimes coincide with our expectations, it can also sometimes confound those expectations.

It is not a human, and we should not expect human-like conclusions from it.

We can do other maths on these embeddings.

We can add them together.

If we take the “glaiviest” glaive, and the “guisarmiest” guisarme, and add their embeddings together, we can then find the image that best represents a combination of glaive and guisarme.

Happily, the result of this operation is a glaive-guisarme — a glaive that has been embellished with a guisarme-like hook.

What have we learnt?.We’ve learnt to respect the inventiveness of medieval peasants, and we’ve also learnt a little about neural networks.

By use of a “hidden layer”, neural networks are able to extract meaningful data from very complex sources — simple images in this case, but also sound, film, and — as we’ll see in a future essay — text.

They can use that meaningful information to make (often very accurate) classifications of new data.

Facial recognition software, more sophisticated recommendation algorithms, document classification, and many other systems all leverage neural networks in this way.

But the embeddings that neural networks generate are also very useful, and power a whole host of other applications: Google’s image search uses — in part — information extracted from images by a neural network.

Chatbots employ neural networks to encode the meaning of questions and answers.

It is tempting to ascribe human-like qualities to the intelligence of neural networks.

After all, they are in some ways imitations of human brain structures.

It would not be entirely surprising if they were able to mimic human-like behaviours.

But it’s important to remember that these systems have a very narrow field of knowledge — they’re trained to do one thing only — and they don’t care how they do it.

Our pole arm recogniser is very good at its job, and our experiments with the embeddings it generates shows that it to some extent recognises some of the same features in the images that we do.

But we also saw that some of its results were quite surprising.

It made choices a human would not.

It is just as likely to pay attention to minute details in the images as to the larger structures; there’s nothing in its programming telling it what features it’s “supposed” to recognise in the images, so it only cares about what works.

That’s a trait that’s common to all artificial intelligences, and it’s a quality that makes them both fascinating and sometimes frightening.

By performing human-like tasks in a sometimes very unusual way, they can feel like an insight into a truly alien way of thinking — if they can be said to “think” at all.

Thanks for reading!.The previous essay in this series “Ten Thousand First Dates: Reinforcement Learning Romance” is available here.

All the code for this essay is available from my github, here.

The next essay in this series will be published next month.

.. More details

Leave a Reply