Depth Estimation on Camera Images using DenseNets

Depth can be stored as the distance from the camera in meters for each pixel in the image frame.

Figure below shows the depth map for a single RGB image.

The depth map is on the right where actual depth has been converted to relative depth using the maximum depth of this room.

RGB Image and its corresponding depth mapData setTo build a depth estimation model, we need RGB images and corresponding depth information.

Depth information can be collected through low cost sensors like Kinect.

For this exercise, I have used the popular NYU v2 depth data set to build a model.

This data set consists of over 400,000 images and their corresponding depth maps.

I used a subset of 50,000 images from the overall data set for the training task.

Model OverviewI read through several papers that perform the task of depth estimation and many of them use an encoder decoder type neural network.

For depth estimation task, the input to the model is a RGB image and the output is a depth image which is either the same dimension as the input image or sometimes scaled down version of the input image with the same aspect ratio.

A standard loss function used for this task considers difference between the actual and predicted depth map.

This can be L1 or L2 loss.

I decided to work with the Dense Depth model from Alhashim and Wonka.

This model is impressive in its simplicity and its accuracy.

It is easy to understand and relatively fast to train.

It uses image augmentation and a custom loss function to get better results than from more complex architectures.

This model uses the powerful DenseNet model with pretrained weights to power the Encoder.

Dense Depth ModelThe encoder presented in the paper is a pretrained DenseNet 169.

Theencoder consists of 4 dense blocks that occur before the fully connected layers in the DenseNet 169 model.

It is different from other depth models in that it uses a very simple decoder.

Each decoder block consists of a single bi-linear upsampling layer followed by 2 convolution layers.

Following another standard practice in encoder decoder architecture, the up sampling layers are concatenated with the corresponding layers in the encoder.

Figure below, explains the architecture in more detail.

For more details on layers please read the original paper.

It is very well written!Encoder Decoder model from Dense Depth PaperTraining and Testing the Depth Estimation ModelDense Depth was trained on a 50K sample of NYU-v2 data set.

The input were RGB image of 640×480 resolution and output was a depth map of 320 x 240 resolution.

The model was trained in Keras using Adam optimizer.

I leveraged the repo from Dense Depth authors to get started.

Three different model architectures were trained:The original code provides an implementation of DensetNet 169 en-coder.

This model was trained for 8 epochs (9 hours onNVIDIA 1080 Ti)Original code was modified to implement a DenseNet 121 encoder which has fewer parameters than DenseNet 169.

This model was trained for 6 epochs (5 hours on GPU) as validation loss had stabilized by thispointOriginal code was modified to implement a Resnet50 encoder which has more parameters than DenseNet 169.

I experiment with different ways of concatenating the fea-ture maps of encoder and decoder.

This model was trainedfor 5 epochs (8 hours on GPU) and training was discontin-ued as model had started over fitting.

All these code modifications are pushed to the github repo with explanations on how to do training, evaluation and testing.

Evaluating the different modelsWe compare the model performance using three different metrics — average relative error (rel) in predicted and actual depth, RMSE (rms) — root mean square error in actual and predicted depth and average log error (log) between the two depths.

Lower values for all these metrics indicate a stronger model.

Model ComparisonAs shown in the table above, DenseNet 169 model has better performance than DenseNet121 and ResNet 50.

Also my trained DenseNet 169 is very close in performance to the one from the original authors (Alhashim and Wonka).

ConclusionWe hope this blog proves to be a good starting point to understand how depth estimation works.

We have provided a pipeline to use a powerful, simple and easy to train Depth Estimation model.

We have also shared code that can be used to get depth images from indoor images or videos you collect.

Hope you try the code out for yourself.

I have my own deep learning consultancy and love to work on interesting problems.

I have helped many startups deploy innovative AI based solutions.

Check us out at — http://deeplearninganalytics.

org/.

If you have a project that we can collaborate on, then please contact me through my website or at info@deeplearninganalytics.

orgYou can also see my other writings at: https://medium.

com/@priya.

dwivediReferences:Dense Depth Original GithubDense Depth PaperNYU V2 data set.. More details

Leave a Reply