Using Machine Learning and CoreML to control ARKit

Using Machine Learning and CoreML to control ARKitDan WyszynskiBlockedUnblockFollowFollowingJan 9Combining image classification and augmented reality to create new experiencesSo far in our AR journey we’ve gone through the basics of putting objects in AR, controlling them via animations, detecting planes with placing items with hit detection and have explored various ways of visualizing algorithms in 3D space.

Let’s now delve (albeit lightly) into the large, and complex world of machine learning.

A popular form of experimenting with image classification has been to recognize hand gestures.

We’re going to take that concept a bit further today.

We’ll begin by training a simple model, importing it to CoreML and the Vision framework to classify a couple of hand poses, and then use that to control a 3D model in ARKit.

We’re going to touch on a single area of interest when building our machine learning model, namely, image classification.

While there are several ways of training models such as TensorFlow, Keras, and even Xcode’s own training interface, we’re going to use a free online service that Microsoft has created called Custom Vision.

It’s easy to get started and for prototyping ideas quickly without needing to do much prep work or code.

Go ahead and create an account on https://customvision.

ai and explore the interface.

What we’ll need to do before we create our project in the Custom Vision dashboard is to begin taking pictures of our hand.

We’ll need to take 3 sets of images, one for an open hand, one for a closed fist, and the last being no hands at all (the model works best when it can differentiate between each set).

Microsoft recommends at least 50 images per tag, so we’ll need to take a lot of photos.

Separate your images into 3 different folders (hand_fist, hand_open, hand_none).

Samples of the 3 sets of images usedCommon practice is to take our images and downsample them to a lower resolution.

CoreML expects images in the size of 227×227, so what we’re going to do is take all our images (I ended up with 87 fists, 79 open hand, and 54 no hand images), and scale them down.

All my images were portrait since I will use a portrait orientation in the AR app, so I scaled all the images down to 300×400.

This will make the uploading much quicker, the training faster, and the conversion in the app happen in realtime.

One quick way to resize the images is to select them all, then right-click, and select “Open with Preview.


Next, select all the images using ⌘-A or Edit->Select All.

Finally, under the Tools menu, select Adjust Size.

Make the images 300 width, making sure the Scale Proportionally checkbox is checked on.

The hit save, and all your images should be resized.

After scaling all the images down, it’s time to create a project in Custom Vision.

You can call it anything you like.

The Project Type should be Classification and the Classification Type should be Multiclass.

Finally, the Domain should be General (compact).

This will output a small model which we can then import into the app.

Once the project is created, we’re ready to begin uploading images.

Hit the + button in the top center frame under Training Images, and select all the images in the hand_fist directory.

Hit “Open” and at the next prompt, add the label that we’re going to use for our model (also hand_fist).

Tagging fist imagesGo ahead and hit upload and we’ll then see the images appear with the corresponding tag in the list.

Repeat this step for the hand_open and hand_none images, with one exception for the hand_none images; for this set of images, we will use the Negative tag that is default in the label dropdown.

Once all our images are uploaded, we are ready to train the model.

Hit the green gears button and wait for the process to finish.

Navigate to the Performance tab to see what the results look like.

This is what mine looked like after training.

Prediction scores for the trained modelTo test the model, take a few more photos and upload them by clicking the checkmark button next to the green gears.

If everything looks good and you’re getting a good prediction, then we can move forward.

Otherwise, take a dozen or more images of each pose and upload those using the tags we used previously.

Keep testing with new images to refine your model if necessary.

We’re now ready to export our learned model.

Click on the little down arrow icon in the top of the center pane and choose iOS (Core ML).

Then Export and Download the model.

Save it somewhere handy, as we’ll be using it shortly.

Now we’re ready to begin our AR project.

ARKit Project SetupWe’ll begin by using the ARKit Xcode template.

If you’ve followed along in our first tutorial Getting started with ARKit and SceneKit, then you can skip this setup and go straight to the creating the Scene Creation section.

Open Xcode, then begin a new project by selecting File->New->Project or using ⇧⌘N, then choosing the Augmented Reality App template.

New Augmented Reality AppEnter a name for the project, make sure the code signing and bundle name are correct, click Next, then choose a directory to create the project.

With our project created, we can run the app and see the default ship that the sample project displays in AR.

One thing to note is that the ship is placed in a position that is relative to where the camera was facing when the project was started.

Kill the app and run it a couple more times, each time facing a different direction.

The ship always shows in front of the camera, at the same distance.

This is because the ship is positioned -0.

8 meters in the scene.

The scene is created when the camera opens and the session begins.

If you look at the ship.

scn file, and drill down to the shipMesh object, you’ll see that it is placed at (X: 0.

0, Y: 0.

1, Z: -0.


One thing to note is that the Z-coordinate is negative.

This is due to the fact that negative-Z is pointing forward in the camera axis.

Now that we understand how the sample app works, the next thing we’re going to do is delete the scene assets (the ship) and use our own models.

Let’s make a new group in our project and call it Models.

In here we will put in the sphere.

dae file that we modified earlier.

Drag and drop our model file and its texture into this group, or Control-click the group name and choose Add Files to Project then choose our files to be added.

SceneKit importWe’re going to create a custom scene in our code, load a Collada file with our model, and put it into our scene.

This allows us to use more than one file if we want to load multiple models into our scene.

It will also give us control in what we add, since many times there are things in the file that we don’t want to import, such as additional cameras, lights or empty objects.

Our goal is to minimize the amount of cleanup we do with the source files in the SceneKit editor.

Let’s check out our model file and make sure the rotations and scale are correct, and all the textures are properly connected.

Select the model file, then open up the Scene Graph View by clicking on the tab button on the bottom left of the scene view.

Select the model and the Node Inspector in the properties panel.

Our oft-used sphereScene creationNow that we have our model file ready to go, we’re going to create a new scene.

This will set us up later on to customize the scene and not have everything in one file.

Create a new file, called MainScene.


In the struct we’ll create, we’ll be putting all our scene management code, including some simple lighting and object creation.

Think of it as the controller for the 3D world.

Next, we need to change our ViewController code to use our new scene instead of the default loading from the DAE file.

In our ViewController class, add the following line under the sceneView declaration on top of the class:var sceneController = MainScene()Next, in the viewDidLoad, remove the line that loads the scene:let scene = SCNScene(named: "art.


scn")!And replace it with our own scene creation code:In our viewWillAppear, make sure we’re setting the sceneView session delegate to self:sceneView.


delegate = selfModel LoadingWe’re going to need to load up some models, so we’ll create a convenience function to load up a scene from a model and place its contents into a container that we can then place into our scene.

Since we’re loading nodes, it makes sense to create an extension to SCNNode.

Begin by creating a new group in our project called Utilities.

In this group we’ll add a new Swift file called Node+Extensions.

swift, whose contents will be the following:Creating our worldNow that we’ve got the scene creation and object loading utilities in place, let’s load up a model and put it on the screen.

First, we’re going to create a convenience class to wrap SCNNodes.

This will encapsulate all the code to load objects from files.

We’ll base other scene objects from this class, and have the ability to directly manipulate its properties or add new abilities at a base level.

Create a SceneObject.

swift file (I added it to a new group called Scene Objects), and add the following code:Next, we’ll create a class for our Sphere model:Here we make use of our handy base class loading routine and it keeps our object class nice and clean.

We’re going to need a method for placing our Sphere into our scene, and at a specific position.

Let’s create a function in our MainScene that does exactly this.

If you’ve followed the other tutorials, you’ll know that I like to jazz things up a bit.

One of my favorite things to do is to have a fun pop-in animation as a reveal.

We use this same reveal at Nike in our SNKRS Cam experience and it’s a great effect.

In our MainScene, we’re going to create a timing function that we’ll pass to a scale-up animation.

This will let us control the timing curve when the scaling reveal happens.

Next, we’ll create our addSphere function, passing in a node that will act as a parent, and the position (relative to that parent node).

Note that we create our Sphere and immediately scale it to 0.

01 on all axis.

We then apply our reveal by creating a scale action, apply our newly created timing function, and run it.

Now that we have our model sorted out, let’s make a couple of changes in our ViewController to be able to place the model on a tap of the screen instead of having the model appear whenever the camera is initiated like it does in the template.

In our ViewController’s viewDidLoad method, after setting the sceneView’s scene, add a tap gesture recognizer.

Next, we’ll add the didTapScreen method.

This simply takes the identity transform, offsets by 5 meters (ARKit’s unit of measure) from the z-axis, and does a matrix product, or matrix multiplication to combine the two transforms.

This has the effect of getting a transform that is 5 meters away from the camera.

We pass that position to the addSphere method and when we next run the app, we should be able to place our robot spheres at the tap of the screen.

Place our spheres wherever we wantNow that we have the AR piece sorted out, let’s begin integrating CoreML into our project.

We have a VisionFirst, we’ll create a new group in our project and call it CoreML.

Into that group we’ll import our previously created CoreML model.

To be able to pass images into the model for classification, we’ll need to use the Vision framework.

Go ahead and add an import Vision statement in our ViewController.

Next, we’ll add a few definitions to our ViewController.

We’ll need to do a few things: Create an instance of our hand gestures model, create a queue where we’ll asynchronously run our Vision requests, and set up a repeating loop for CoreML to grab images from the camera to process through Vision.

Let’s create a function to initialize our CoreML model and set up the callback function for our classification requests.

We need a function to take in the current camera frame and pass it off to Vision to make the perform the CoreML classification.

If we weren’t using ARKit in our project, we would need to do the usual creation of a camera loop using AVFoundation, but since ARKit is giving us a live feed of the world, we can make use of this handy feature.

A couple of things to note here.

I get the current device orientation, and I use that information to pass it along to the vision request as part of the handler call.

This gives Vision a hint as to how the incoming image is oriented to be able to make a fair try at classifying it.

I ran into some trouble here, as I originally tried to rotate, scale and crop the incoming ARKit camera image into the 227×227 image that the CoreML model is expecting.

I believe it has something to do with the capturedImage frame buffer and trying to modify it by creating a CoreImage instance seems to not create a proper copy.

In any case, setting the classificationRequest cropAndScale option above, and passing the orientation hint to Vision here seems to work well enough, but this is something to keep in mind.

The getImagePropertyOrientation method is an extension on UIDeviceOrientation.

We want to keep making classification requests, so we’ll create a repeating timer to call our loopCoreML method.

After the visionRequests variable definition, add a new one for our timer:private var timer = Timer()Next, we’ll create a function to update our CoreML classifier:We’ll call this method via the timer we just created.

In our ViewController’s viewWillAppear, we’ll call our setupCoreML method, and following that, set up the repeating timer.

This is what the function looks like.

We’ll call our loop every 0.

1 seconds.

That should be enough to get quick results and not bog down the system.

We’re almost there.

We have initialized CoreML, put our CoreML loop into place, making Vision requests, and now all that’s left to do is to write that callback function that gets called when a classification is made.

Still in our ViewController, let’s create our callback.

Note that for this first pass, we’ll print some debug info.

This helps us understand what is being classified (or not), and what its prediction confidence is.

We’ll take the first three items that are observed in the model (we only have three), format them into a readable string, then base our logic from the top prediction.

Run the code and begin making fists and open hand poses in front of the camera.

Check your debug output to see how the model is functioning with your data set.

In my own, it largely depended on what the background contents where in view for Vision to make accurate classifications.

That is why I set the score threshold to be extremely high at 0.

95 (95% confidence).

You may need to lower this value greatly to get a valid prediction that you can use in this next part.

Once we are satisfied with our results, we need to use the captured information from our classification to drive some action in our scene.

Let’s make our gestures start or stop our Sphere spinning.

In our Sphere file, add the following declaration and methods.

This will make the sphere rotate about the Y axis.

Next, let’s modify our classification callback in our ViewController.

You can remove or comment out the debug prints at this point.

After the topPredictionScore check add the following.

Don’t forget to change the minimum threshold to a value that makes sense for your data set.

Also make note of the prediction names used.

You may need to change these to whatever tags you used when training your model.

Give that a run.

Tap on the screen to get a sphere placed in the world, and then put your hand in front of the camera and begin alternating with fists or open hands.

Your sphere should animate and stop at your command.

You’re a wizard now, Harry!Magic handsExtra CreditYou may have noticed one thing in our app.

If we have multiple spheres in our scene, only the first one responds to our gestures.

We can do better than that.

Let’s figure out which one we may be looking at, and then act upon that particular sphere with our gestures.

In our ViewController, we’re going to make some modifications to the func renderer(_ renderer: SCNSceneRenderer, updateAtTime time: TimeInterval) method.

First, let’s create a variable to hold a reference to any object which may be centered in our screen.

At the top of the class, add:var centeredNode: SCNNode?Next, we’ll change the updateAtTime method.

Add the following code at the end of the function.

What we do here is find the screen coordinates of any object visible within the camera view (view frustum in 3D terms), check whether or not they are in a center area defined by our screen width and height and a square of a certain size, and then set a centeredNode variable with that sphere.

Next, we’ll make use of the centered node instead of looking for the sphere.

In our classificationCompleteHandler, we’re going to change the line that begins with guard let childNode =.

to the following:Now, run the app again, place a few spheres around you, and you’ll be able to control each one individually with the hand gestures.

Now you’re truly in control of the world!Extra Credit, Part DeuxStarting and stopping the rotation is great, but what if you could expand on that and do something a bit more?.Let’s take some photos of our hand doing left and right pointing gestures.

Make sure you take the same amount of pictures as you did for the fist and hand open gestures.

Upload them to your Custom Vision project and re-export your model.

In our Sphere class, we’ll rename the current animate function to animateRight and create a new method call animateLeft.

We’ll also change the direction of the rotation for the animateLeft call.

This is what our animation functions look like now in our Sphere class:In our ViewController, where we get the prediction, change the code to the following.

Be sure to use the same tag names that you used when uploading your new photos.

Run the code again and suddenly you are a Jedi Master.

Use the force…I hope this tutorial inspires everyone to dip their toes into the world of machine learning and beyond.

If you have any questions, don’t hesitate to reach out or write in the comments below.

Stay tuned for more fun coming soon!Don’t forget to check out the s23NYC: Engineering blog, where a lot of great content by Nike’s Digital Innovation team gets posted.


. More details

Leave a Reply