Uncovering Computer Vision with
Samyak Datta

Editorial, 1st August 2017

With Computer Vision, Machine Learning and Neural Nets becoming some of the most discussed and highly researched topics in the field of Computer Science, we were lucky to have a conversation with Samyak Datta who is the author of the book “Learning OpenCV 3 Application Development”. Samyak finished his Bachelor’s and Master’s degree in CS from IIT Roorkee and will be joining the Computer Science Department at Georgia Tech to pursue his Ph.D. this fall. Here are the questions that we asked him which successfully provide a bird’s-eye view of the entire field.

At a very abstract level, Computer Vision (CV) deals with teaching machines how to “see” the way humans do (in fact, this is the ultimate goal of Computer Vision as a field). It’s a specialization area within Machine Learning (ML) where we teach computers to analyse the contents of images and videos and make meaningful inferences out of them. For example, with the help of CV algorithms, computers can be taught to identify objects in images, classify images into categories (“indoors”, “scenery”, “buildings”, “cars” etc.), recognize the faces of people appearing in photographs and so on.

As I mentioned, CV is a specialization within the broader field of ML. If you look within ML, you will find several “classes” of learning algorithms such as Bayesian Learning, Variational Inference, Statistical Learning Theory (good-ol’ SVMs) etc. In the recent years (post-2012), one particular class of algorithms has witnessed a resurgence in terms of popularity and ubiquity — Neural Networks. Deep Learning is neural networks on steroids! The networks became “deeper” (i.e. they had more number of hidden layers), the number of parameters exploded into millions and the machines that they were trained on became faster (GPUs). Interestingly, if you trace the history of this field, you’ll find that all the pieces of the puzzle were developed long before 2012 by groups led by researchers such as Yann LeCun, Geoff Hinton and Yoshua Bengio. Since 2012, deep learning started gaining traction within the research community due to the availability of large datasets, faster and cheaper computing resources and some other factors. The motto of the Greyjoys aptly captures the history of Deep Learning in a single sentence — “What is dead may never die, but rises again harder and stronger”!

I think having a preliminary exposure to Machine Learning helps. I did a course on ML (EE Deptt.) during my final year that had some topics on Neural Networks — so that helped get me familiarized with some neural network jargons. I was fortunate enough to attend a week-long summer school (https://cvit.iiit.ac.in/summerschool/) at CVIT, IIIT-H right at the beginning of my deep learning journey which provided me with a very good bird’s eye view of the entire research space in and around deep learning.

Apart from that, Stanford’s CS231n course by Andrej Karpathy is an excellent starting point for Deep Learning basics. Almost all course content — assignments, video lectures, notes etc are available online. After doing that course, I directly dived into a research project. So, most of the hands-on, practical knowledge came as a result of working on the project.

About a year and a half back, there were a multitude of libraries floating around (Caffe, MatConvNet, Torch7, Keras, Theano, Tensorflow) and selecting/sticking to one was a real challenge. Fortunately, things are converging now and I can see the DL-library space being dominated by the two major players at the moment — Tensorflow (by Google) and PyTorch (Facebook). I have mostly been working with Torch7 which is the Lua-based predecessor of PyTorch. I have started migrating to PyTorch recently (yet to recover from the very strong Torch7 hangover though!).

There are a couple of key processes that underpin the functioning of any DL library — the forward pass and back-propagation (flow of gradients). If you have a decent understanding of how these two are happening within the library of your choice, then things become straightforward.

I think there is some logical progression of steps that you have to go through when you start tinkering with a new library. You start off by loading and working with pre-trained networks, then you would probably move on to assemble small, toy networks for “Hello World” problems such as classifying MNIST before moving on to train bigger networks from scratch using perhaps your own datasets. As a final step, you can probably try to build arbitrary network architectures and loss functions and implement some recent, state-of-the-art papers — that is when you know you are comfortable with nuances of the library.

OpenCV is a very popular, open source library for image processing and computer vision applications. The library has been implemented in C++, and you can use OpenCV’s API with popular programming languages such as C, C++, Python, and Java. It is incredibly fast and has a lot of hardware-level optimizations built-in as well. I have extensively used OpenCV for my projects in both the industry and academics — deploying OpenCV based applications on production servers as well as writing OpenCV scripts for my research projects.

I think OpenCV is an excellent resource for beginners as it provides you with nice, clean and efficient implementations of most of the popular image processing, computer vision and machine learning routines. For example, face detection (using the Viola-Jones algorithm) is a couple of lines of code and runs in real-time. It has a very active community of developers and hosts a well-written documentation for beginners. You can also take a look at my book, Learning OpenCV 3 Application Development, which has been specifically written for beginners in OpenCV/C++.

Excellent question! At around 2012, post the success of the seminal AlexNet paper, the Computer Vision community was quick to jump on the bandwagon and apply supervised, deep networks to all possible problem spaces within Vision — image classification, object detection, semantic segmentation, face recognition and many more. All top Vision conferences/journals were flooded with Deep Learning papers as benchmarks were shattered and accuracies soared higher.

Recently, over the last couple of years, within the space of DL+Vision, the focus is shifting towards unsupervised learning approaches for a variety of reasons. First, supervised learning for training deep networks needs a lot of (manually) annotated data which is hard to obtain and harder to scale. Second, unsupervised learning is much closer to how humans develop their visual learning/reasoning abilities (think about how a human baby learns about visual concepts such as object permanence and develops a model of plausibility of the world around him/her). Third, it hits the sweet spot between large datasets and weak labels — nice characteristics to have in DL. So, unsupervised learning is touted to play a key role in the development of Artificial General Intelligence (AGI). The north star here is to have a learning system watch thousands of hours of video (readily available, courtesy YouTube) and automatically learn a model of the world around us.

Note: This is not to say that the community is forsaking supervised approaches altogether. There was a paper uploaded on arXiv recently (link) where they explored the impact of large datasets (300-freaking-million images) on vision tasks.

Also, another thread of development is moving towards generative models which try to approximate the probability distribution that the data is drawn from. While discriminative learning techniques are equated to “intelligence”, generative models can be thought of as being equivalent to “creativity” for machines. So, we are also seeing deep net architectures such as GANs, VAEs hallucinating images of everything ranging from cats, human faces to living rooms.

Some other fascinating topics that I am seeing at the moment are related to learning the architecture of the deep network itself for a given task (Convolutional Neural Fabrics) and whether a single architecture can be trained for learning multiple tasks (One Model To Learn Them All).

If only I had this foresight and prescience, PhD would be a walk in the park for me! I am in the process of learning to pick up the pieces and connect the dots in the hope that any such predictions regarding the future that I make do not turn out to be completely wrong 😛

Having said that, I can dare to make some preliminary inferences from what I’ve observed and gathered by attending talks during my very brief gig as a researcher. Historically, if you see, the CV community has always progressed towards more and more finer forms of visual understanding. So, starting from classification (“Is there a dog?”), it moved to localization (“Where is the dog?”) and semantic segmentation (“Which pixels belong to the dog?”) and finally to more recent forms of fine-grained classification (“What breed of dog?”). In a similar spirit, the more recent advances in CV have been concerned with the abstract/subjective forms of visual inference. For example, predicting the humour content in images (link) or trying to figure out which (out of the many) persons in the picture are important by say, social status (link).

I think such research problems will also find acceptance into the industry as well with the recent developments that I’m seeing in AR/VR based consumer gadgets. I would love to read more about tangential, non-technical topics such as ethics in AI or how social/cultural factors and biases permeate into the algorithms and systems that we build. For example, there were studies done recently that uncovered the impact of racial bias in face recognition algorithms (link). As AI starts to become a more intimate part of our lives, I feel that such issues will need to be addressed.

You mean apart from grants, funds, stipends, paper deadlines and the ever-increasing house rents? Jokes apart, there are several open research problems in computer vision (and ML, in general). We are still a very very long way before we reach anywhere close to Artificial General Intelligence (AGI) which has been the dream goal of every CV/ML/AI researcher.

Just to throw in some examples, interpretability of deep learning models is one such big challenge. There have been some papers in this space which try to visualize what networks that we train actually learn, or which region(s) of the image is the network looking at while making certain decisions etc. However, a lot yet remains to be understood.

At a philosophical level, it is interesting to note that centuries ago, the Newtons and Einsteins were able to condense the explanations of complex physical phenomena into a set of compact and elegant equations. Unfortunately, this is no longer the case even with the simplest deep networks that we train today. The models that beat state-of-the-art or win Kaggle competitions are ungodly ensembles of deep networks that have very limited practical utility. One of the major challenges would be to fit these models into memory so that they are able to run on say, our smartphones.

A correction: I worked in the industry (Media.net, Directi) for a year (2015-2016), and then I shifted to CVIT Lab, IIIT-H as an RA.

My first taste of research was during the course of working towards my M.Tech thesis at IITR. I did enjoy several aspects of the process and that was probably the first time that I started considering a PhD as a viable option. At the time, I already had a job offer from Directi (Media.net). I had interned with them the previous summer and liked the work there as well. I was at a professional crossroads of sorts. I ended up joining Media.net as a Software Engineer with the intention of seeing both sides of the coin before taking a final plunge.

Although the work at Media.net was good, it wasn’t very strongly aligned with the kind of problems that I was passionate about and hence I couldn’t see myself doing that long-term. I left my job at Media.net after a year and joined CVIT, IIIT-H as a year long RA. The nature and style of academic work was something that really resonated with me (is there anything more exciting in life than seeking answers!). I loved the freedom to fail, to take full accountability of your projects and took inspiration from the manner in which research embraces uncertainty. In between Media.net and CVIT, my book happened and I realised that I like teaching, writing and exposition in general. These are things that were in abundance in academia.

One of the best things that happened as a result of this transition is that it shattered a commonly held myth regarding research that was festering inside me. There is this romantic idea of research being regular, poetic epiphanies from God. Nothing can be more separated from the truth! What you ultimately realise is that research is persistent, incremental and skilled intellectual labour sprinkled with the occasional “not-so-bad” insights over a period of several years.

Masters v/s a PhD is an important decision. I think if you are confident that you want to pursue research as a career (either in academia or in industrial research labs), then it makes sense to directly go for a PhD.

If you want to test the waters, or if you are changing specializations after Bachelors, then Masters may be the right choice. However, do keep in mind that (most, if not all) MS programs have a significant course load. So, there is a limit to the time and energy that you can devote to research.

On a related note, doing an RA-ship in a research lab for about a year after your B. Tech is also an excellent way to improve your chances and gain valuable research experience. As CS grad school applications get competitive, this is becoming a very common practice.

The only advice that I can offer is to do good projects in your area of interest. If you have made up of mind to pursue research, then industry internships/projects hold much less value than academic research projects (bonus points if the research has been done under the supervision of someone famous in your area).

Another practice that I am in favor of is to maintain a good GitHub profile and release code for the projects that you do. In general, making things accessible online, whether it is code, reports or other manuscripts related to your project, is a good thing to do for a couple of reasons — (1) it makes your research accessible to and hence reproducible for others, and (2) it compels you to not do mediocre and substandard work.

To consolidate all of this, you can also create a website that hosts your bio and other project-related information. It is super-easy to create professional looking academic websites using Jekyll. In today’s day and age, if you cannot set up a simple web-page, then you should probably reconsider your decision to do graduate studies in CS 😛

My research (till date) has mostly been focussed on using deep learning in the very important domain of human faces. I have worked in both the fundamental aspects of the problem domain — learning discriminative face representations using deep networks and some applied aspects as well — gender and emotion classification, large-scale face retrieval etc.

I am really excited at the prospect of exploring other new problem domains during the tenure of my PhD. The lab that I’ll be joining at GaTech has been doing some really cutting edge work in areas that are in the intersection of CV and NLP.

Coming to the challenges, there are always several research challenges that you have to navigate when you are working on a project. It can get a little debilitating on your morale when you are not able to get favorable results for days, or when the answer to a problem eludes you for a long time. Apart from these, when you are dealing with large-scale datasets (million-scale), even as researchers, we have to deal with engineering challenges (in addition to research problems) on a day-to-day basis.

I got an email from Packt Publishers one day wherein they enquired about my interest in being the author for a book on OpenCV/C++. Since this was something which was at the intersection of two of my passions — Computer Vision and writing, I was more than happy to accept. Also, writing a full-fledged book was something way out of my comfort zone at the time. I had prior experience in writing blog posts and the closest that I had been to the scale of a full book was my Master’s thesis (which is less than one-third the size of my book). So, I drew a certain amount of inspiration from that challenge associated with the project. One of the major challenges while writing was to explain technical concepts while simultaneously not losing sight of the big picture. Moreover, if a book is targeted for beginners (as is the case with mine), it becomes an incredibly potent tool that can shape the perspectives of a complete novice in the subject matter. After  making sure all diagrams/flowcharts are proper, going through the PDFs to see how the Chapters would look in the final version and uploading a bundle of code/software, the book was published and is available for order (both the e-book and the print version) from the Publisher’s website.

Article Tags :