Prof. Subhasis Chaudhuri, Director, IIT Bombay.


The purpose of this discussion is not to give you a solution but to provide more of a thought process regarding where we are stuck and what is required in terms of the analysis that should be carried forward.

When it comes to pattern recognition, one common example that many people quote whether the system recognize a person or all humans apart from something else. This has been a classical problem even before computer science because this is one of the most fundamental things, and it is very inherent in terms of our human evolution or even in terms of the growing up process.

So, the overall building block is something we call the features. It could be our height, it could be the length of our nose, color of our skin, and similar. On one hand it is a piece of data which is given, and on the other hand we have a computing engine which can learn to compute based on this data. This is typically what we call a classifier problem. It requires training in terms of classifying and labelling given data as human being, a monkey, and so on. Basically, the computer learns based on the given data.

Focus on Features

In the earlier days, this kind of training through computing was performed using algorithms. Now it is based on neural networks and applying machine learning perspectives. In this approach we don’t talk about the learning algorithm, we simply say it’s a network architecture. This is the overall framework, and we would like to particularly focus on features. When you talk about the features it cannot be done without a task involving pattern recognition.

There has been tremendous advancement in the last two decades in terms of how to minimize the cost function. The optimization field has really done very well and almost any kind of cost function can be minimized and the choice of what should be the “learning” and the appropriate network architecture can be decided at the time of learning. So, that decides, in the classical form, what the algorithm is.  Since we have been working mostly on image processing, I will just try to restrict the domain to image processing, but it does not have to be because many things can translate into other data streams. Also when we train, we do not feed in the image directly. We extract certain parameters which map to some function of the image and feed it in. The question is what is that function that we should feed in. And this is basically the focus of this article.

To define this further let me go back to the classical Mahabharata. When Dronacharya was training his royal disciples in archery, as the story goes, and he asked them to describe what they see and only Arjuna is said have  uttered the now popular line, “I see only the eyes of the bird and nothing else.” This story is now repeated time and again as the proof of how one must focus on the target. So, it is in our case of machine where we focus on what we want to see, the target. When visually we look at and object we try to focus, we make certain accommodations such as moving our head, etc., to see the object clearly. Now the question is in automation what would be that process How does the machine know what is to be done? The target must be mapped into the machine’s internal system and allow it focus on the target.

Let’s go back more than 60 years. If you look at the classic work of Hubel and Wiesel, which won them the Nobel Prize - they showed a lot of pictures to primates and studied what is happening in the visual cortex because that’s where the brain maps what it is seeing. They surprisingly found that it fires up a specific set of cells which are oriented and concluded that you find out the images in the picture which are oriented along different directions. So, it was like a match filter in communication, but they are oriented in specific directions, and depending on the selectivity the brain tries to recognize the object. That’s your sensing mechanism saying these are the features that we want to sense.

So if you have a classic image, Lena, the brain would be looking more on the edge map, since these are the ones it automatically tries to recognize. If you try to mathematically write it, you take the directional derivative, and sometimes it could be noisy as there could be other things. But some of them could be sharp, in which case you opt for some kind of smoothing around which is typically like a canny edge detector.

In modern terms, we call it hand crafted features but may be these are actually the brain crafted features because our brain wants to see a particular kind of thing because that’s easy for it to understand.

In the current age, the feature space is called C(X, Y). If we don’t know the features in a machine learning problem, we apply training data as the image itself and define a set of convolutional features. We set up a series of these and put it inside the optimizer and then try to make the machine learn it. So the idea is that it should not just see, rather like a brain we should let the machine figure out.

This is called end-to-end training and at the end of this you get a corresponding action to the optimal set of features, based on the given cost function. So, no more hand-crafted data and in case of a convolution there is no translation variants, but rotational variants exist. When we talk of recognition, this aspect is neglected by most people in any machine learning situation because in recognition shifting of an object little does not matter since it is the same object. But if you want to find a regression and want  to find the angle and depth, then the shifting does matter.

Let’s take the example of a fingerprint in a biometric system, such as one like the Aadhar. When this is linked to an attendance system you expect the fingerprints to be matching with very good accuracy. But one can very easily get some kind of gelatin prepared and use it like a mask to fool the system and give attendance. Such spoofs have already happened in many places and if you look at the corresponding two fingerprints, they look almost identical. But it is interesting that by doing this kind of process in machine learning even though we are not manually able to see the difference, but the machine can find out.

So, if it detects a fraud the machine does not even bother to do the authentication because this is a fraud anyway. But one could also give the opposite command, meaning now there is a certain set of features which we are not able to perceive but the machine can pick it up. On the other hand, let's take a case of Captcha challenge where a human can solve it but the claim is that the machine cannot, meaning the robot fails to get into the system or automate. In other words, a human is very good at extracting the features of the image but a machine is not.  Let’s go to the very classical features which are hand-crafted. Take histogram, which is basically just a distribution, an easy permutation invariant. One can do random permutations and the image will still be the same. Then there are different kinds of spatial movements, used in earlier bean-picking problems since they were to be very common because you could do invariants to rotation, translations etc. Perspective invariance options also exist in 3D, so one can apply cross ratio, options to gather data, etc.

How Many Dimensions to See?

Many people have used Histograms of gradients, and they are very popular because one gets to imbibe some knowledge -- Co-occurrence matrix, texture models, Eigen Shapes, etc.. These are all classical methods and all of them have found use in the earlier days. If you use these methods, the next question is what's the dimensionality of the features. Because when one is creating these features they need to know the appropriate dimension and the size of features. The smaller the dimension, the easier it is to handle. And if the claim is that they are robust to noise but may end up being less discriminative because you might have lost certain kind of information and you can use some technique such as PCA. On the other hand sometimes there is also tendency to enhance the feature space, like making the dimension as high as possible.

We used to talk about causes of dimensionality and all kinds of things, but sometimes it is a better discriminant ability. So you find out the better classifications seeing that which particular way it can actually classify and the key problem is that when you have the dimensions is too high, now the question is that I have to learn that also? Whether this is a classical supervised or modern thing, then how many training data set you need, you may not have that much of thing and I can give you an example where we have very few training data, so in that case it corresponds to a problem called as small sample sized problem, so how to handle that.

Take an example of the person re-identification problem. What happens is you have two to three gates of IIIT here and somebody comes in from gate 1, maybe goes out from gate 2 or gate 3, and you want to find out whether the same person has gone out or not. In this case I don’t want to recognize the person since that’s a different task, I simply want to see whether these two people are the same.

So you have to say the person on the query does it belong to all other people who have in the last one hour where they have gone, so that’s the problem and in order to solve it if you look at certain literature then you will see them trying to do lot of things like taking the horizontal strip and within that finding lots of different types of histograms, etc. Even though the original image is 60 x 100 and typically the data is around 6000 – 7000 pixels, normally the training that you can do are only hardly 4, 5, 6 data points for each one of them and you have to train for one group and then train the other one. Because one cannot train as you are coming in, so training will be done with the rest of the data and then new people come in following which one has to do the corresponding matching okay.

Then if there is another group of people, it is something called the LOMO feature. It does a lot of over processing and the dimension is around 27000, whereas the actual number of pixels itself is like only a few thousand, so it’s a huge dimension. Another thing which is popular is called the Gogh feature, which has almost 30000 features. If you do it, you have to learn the classifier on this dimension and high dimensional space, but the number of training samples would be hardly few –not more than 5-6.

If there is a small sample size problem, one of the things that people try to do is something called transform. This is a very classical linear domain problem, so there is no non-linearity here. So what you do is basically a classical feature discriminant method. You have the within class variants on the top and you have the between class variants you want to maximize, this is the problem. So if there is a previous feature that if I can make it zero or the corresponding denominator then it’s maximized provided the top one is positive, and if it is a variance normally you know it can be zero, it cannot be negative, so you are guaranteed to have this one.

If you try to solve it you are finding a linear subspace where you are projecting these points right and supposing it’s a C class problem, C means how many people that you are coming in our case. So it’s much fewer, maybe 500 or 1000, but the training sample for each of them may be very few. Interestingly, all these points in this C dimensional space map into a single point, so if you put it a single point then the corresponding covariance becomes zero in the denominator. That’s the key idea and if you see it in a lateral case when you have so many different types and during the learning process when you have the training data, each class maps into a single problem data and you can do it for any type of classes.

The problem here is that when you actually test the data what matters is the numerator because they get shifted and then you may end up getting a lot of errors. There is a way to try to rectify this, but it is a small sample size problem and the interesting thing is if the linear space for each point in that have a hyperplane left or right, and if I try to do all of them, I am carving out some kind of a convex hull of all these objects and all the points within this convex hull will map into that point okay.

Now imagine that if I have done a wrong point such as an overlapping of points, and you have some points within the convex hull, then both of them will merge into one point which means you know that these two classes cannot be distinguished because they are indistinguishable, they map into the same point which is like a limitation. What happens is that sometimes in the machine learning framework it is good when we can have non-linear boundaries, but whether it is actually classifiable or not is not something that can be done in the linear domain very easily. So these two classes cannot be discriminant period and this we can tell you with certainty.

That is the advantage of this kind of analysis and to show that in the feature you have to ensure that the constraints that you must follow before you try to solve this problem. On the other hand, look at the reverse of the problem, which is a case where we tried to reduce in a C dimensional space. There is a projection here and what you need to do is the Cornell method. This is another popular method which has existed over the last 15-20 years. In this method you try to blow up your feature space and the purpose is to do a linear classification but it could be non-linear as well.

I project the data into some kind of unusual space, then I can use some kind of a classifier and hopefully I may be able to separate them. Many people do this, but of course I need the computational resources and it becomes a lot more expensive. So sometimes it’s a problem particularly if you have so many classes and others. If you think that this is a good feature and that we have some idea of the feature, I will give you an example where I have no idea what the feature is, but we are able to solve the problem. So what you do is develop a big video footage and it could be used in hyperspectral data where you have like 250 odd bands. This is a data volume, so you can take let’s say 8 or 10 such frames or maybe the corresponding bands, and then you randomly select a set of points and collapse it into a single snapshot, which is called coded snaps. Then the next 8 or 16 of them and you can actually have another snapshot. If it is greater than some value take it or otherwise consider it to be zero, which is basically thresholding. You can use some other diffusion process to improve what you call half toning, so then it’s a one bit plane and since it’s one bit but I can also use 8 bit. I can now again squeeze 8 of them and make one frame. So now this single image has the data of all these video streams. If you want to mathematically write the video volume or the  hyperspectral volume, you do some functional mapping and just take the lesser sign bit of thresholding and you have the observation.

If you look at this observation there is nothing you can do to make it look like a random pattern, but you can use some kind of a machine learning technique saying that decode this and get that video sequence back. Because if you are flying using a drone which is connected to a satellite, the battery life is very important for any kind of encode or decode processes that you do on the board since it requires a lot of computation battery as well as transmission. So it actually has no encoding in some sense because any encoder will have a decoder in the loop, and it is an absolutely zero computation in terms of that.

If you look at these three cases - this is the HEVC which is one of the state of the art video encoders, it’s commercially available and this is just one bit. If you generate this, there is some entropy coding that can be removed, but if you play it here you get a very similar thing. I could do it from one frame and if you plot it in terms of the performance, it’s not too different in terms of saving compression time. It’s significant and that includes the entropy code because

if you remove the entropy coder we have almost no computation.

So in this case we have no idea what these features look like because manually we cannot even figure out what is there.

Let’s look at another problem. In certain cases I am not interested in the entire picture, but there is a class of problems called co-segmentation. If I give you a set of pictures and ask the question - is there anything common, and if yes then get that in the foreground. Now within that there could be outlets which are completely different so it should not come up on that particular one. But it should be possible to find out that these are the boundaries and one of the simple ways people try to do this is by using something called super pixel segmentation. You have a segmentation map which is clearly uniform in terms of this along with some hand crafted data (256 dimensional data) that can be put there to find out for the entire set of images where they map in the feature space.

So it’s like a distribution function and in the distribution you need to find a mode in the distribution. If that mode exists, it means that is the commonality that you are looking at it. The problem turns out to be a very classical statistical problem which is a fine mode for a set of very high dimensional data making the mode estimation very difficult. So, you can find out the mode over there and this is unsupervised, this would actually translate to what is the point there. So I have an idea in this case that this  feature would somewhere be closed and I can get the data right, but if you look at the other side I would say that look can I train it.

Finding the features through more computation is expensive and it’s a very high dimensional and very sparse space. So what you can do is have a simple kind of network where you show the image and train it with a defined mask. Whenever they are common you let it learn and if it is the same object update it, and if it is not the same object then don’t update that. There is a decision process which I will not go into, but the predicted mark should be null. Now if you do this, on the left you have this unsupervised approach, and on the right you have a supervised approach, which is of course much better compared to this and is much faster. While training may take time but actual processing is much faster.

So here is another example that if you give all these pictures over there it figures out what is the common part of it. The advantage and the disadvantage that one could state in case of this mode is that in terms of estimation we have some idea of what we are looking at, but in this particular case we do not know what is happening. That brings us to the next kind of problem that if you have training in machine learning for semantic information, and if you have an image of a sofa or a cat that you are interested in classifying, then you give some semantic information that this is a sofa and this is a cat. It is seen that your accuracy in terms of your classifier improves and if that is true then in order to make it a vector representation somewhere you represent.

Now if I don’t call it a cat, and instead I call it billi or something, my other thing would be different but ideally it should not matter. So why not clip that and I call the sofa as a cat and re-train it and what you find and again improve it. So this is the question; you may say, although we know as a human being we try to relate them, you have to understand why it is happening. Let’s look at what I call non-sensifying the semantics. If I have a set of data which is quite cluttered and I wanted to classify it, then the idea is to try and distribute any kind of neural network in terms of mapping so that they get separated very little. Now what it does actually can be called a guide set user prototype, to see what this word is. The semantic information I am giving is just a vector here. There is another vector here and here as well. So normally in the case of two vectors I try to have a better way of giving this information, and make the vectors as far as possible. If you instead make it close, because that’s where then they can map it much further, then the name does not matter. So if I have to do it, so let’s choose upon this where the hamming distance between all these vectors could be much larger and just randomly assign something. You put five of them and see they are separated and interestingly your accuracy goes  up by around 5% or 10% depending on different cases.

Now suppose you have good features and I want to borrow it from your features, so what does it mean? You have solved some problems where you have done a good job and really classified them very well. But if my data is bad, instead of mapping with a vector I say simply that let’s use your data for that particular class as a guide and try to map this, and since they are all separated nicely, just let it learn and then they will separate it out right. If you try to do it now again this also works quite well and it improves. So that means it’s not exactly the value of the prototype as what matters is basically the separation and how you can train. Remember the first assumption is that in any kind of mapping I can train it properly, so as long as you do it then any feature which is well separable you should be able to separate it. That’s why you call it like booknet, because there is guided plastering.

These are the different classes of pictures you wanted to separate and what you use are trained classes to guide you (MNIST data). This is basically numbers and today 100% accurately you can classify. You simply say that on the left you map it to class IV, and on the right one map to 7 and that works better. If that happens then remember that in the machine learning space features such as CNN are like the right thing because they are end-to-end plus you know the train. So they are the optimal set of features, now it is like really the panacea. I told you that this is a shifting variant but there is a penalty associated with that so that’s what we wanted to see. Supposing the features that you are using have a relative shift depending on the edge, because when you do CNN forget about the non-linearity part. If you use it, it’s basically a kind of convolution. Convolution means now there would be change in the function, it is zero processing meaning where the derivatives are zero they get shifted. If it shifts, does it have an effect and this shift will depend on the data. So we can try to do it from the recognition part, but we wanted to do this as more of a regression.

One of the possible ways is that when you want to see something as a 3D object, and next time you wanted to see it from somewhere else. So you want to find out what would be the next view selection because if you see from let’s say this object from here you do not want to see something very close, because that doesn’t add much information, you want to see it little far off. If it is too far off again this is like a human or any other machine problem because they will be very different and then you will not be able to match them, it cannot be just like view that 180 degree, so that’s also is not the solution.

So, this is called the next view problem and here I need to find out where it goes in the next one. It’s kind of a regression and it turns out that if you try to solve this don’t  worry about the network architecture. But if you look at the accuracy, IOU means what is the overall in the common area that you have over there, divided by the actual thing. It is quite low, so even the best possible thing is around 0.6 and we are not able to go beyond that. This means there is an inaccuracy clipping in because of it. So, whenever you have an encoder-decoder problem where the decoding is working fine, but the regression is not working fine because you have this problem of shifting. Sometime what happens is that you don’t look at a problem as an entity in itself but rather the entities in the pictures in different places together constitute a notion of what you will do with the pattern that you have. For example if you look at the left one you are more looking at the Airport, it is not like a plane, and similarly there could be other cases where it is next to a beach okay, but not a desert, so that relates to certain kind of entity.

If you want to do it, you wanted to bring in a relationship between all the adjacent areas.

One of the very common ways of doing it is that you can use what you call GCN. CNN methods are okay and it seems to be working fine but the problem there is that when you have uneven areas in a hand crafted picture but at every

node in the graph these features can be defined. So what is the relationship between labeling you have to learn, then you can actually find out certain things are related but something very far off may not be related. Sometimes there is a problem here because for each picture there is something called the Laplacian for a class of graph which defines the type of relationship. But here for every graph the Laplacian is different, so certain things cannot be done and you have to find some approximate way up to this thing. This is where you can say that you know the relationship edge actually can be estimated, so that is also something that people try.

Now here is a problem which they have reported, and if you look at the first one - the person is a very well known researcher from MIT. You give a speech sample and the network actually hooks up this is how the person looks like. Supposing I have not seen Professor Sadagopan but I talk to him and this thing can actually say this is how Professor Sadagopan looks like. Now they claim that they have done a good job, but this is completely unusual in some sense since by tomorrow somebody may say this is the DNA of the person and this is how they look like.

Then next one, there is a Maharaja Thali and the claim is they can tell you what the calorie value is so that you can choose your diet better. In the third one, the person is carrying some weight, and they have trained the network such that by looking at it they are able to say this is the weight the person is carrying. They claim that they are able to do it but I think there is some justification from the human side because it is the bending of the back that probably tells us that how heavy the load is. But of course that depends on the individual, as to me something is very difficult to carry, but to somebody who is a bodybuilder it may not be. So these are the things I have and personally I am really puzzled by this when my idea is that what are the corresponding things that let you take a decision or some kind of inference that this is what we are looking at it.

I have no answer to that and since we are talking about this, when you look at the different class of objects, in the earlier cases we used to define boundaries which are nice maybe it’s a linear or maybe let us say you do some kind of elliptical, we do make errors sometimes but catastrophic errors were less. But look at this supposing you put this kind of figures and are tweaking your boundary, you know sometimes there could be overfeeding but in the feature space you do not know who is neighbor to that. Supposing you have a small error in the data or maybe you know perturbation, you may end up being recognized as a monkey okay and that looks very different and this becomes an ethical problem because it has happened several times. Somebody who has darker complexion is very different and he said it turned out to be a monkey.

So it’s not that people try to then bring it as an ethics or question whether we should be doing it or not, but I don’t think the real problem is that. Rather, the problem is that by trying to do it in mapping we have no idea what is the manifold of the feature species and if you try to do it sometime we have no idea that this is like very close in the feature space and suddenly you transverse into something that is very different. But on the other hand if you look at the two shoes, they are actually in between and there is a different class coming in. You are able to say that you are happy as a classifier and you say okay this is the state of the heart because I got it recognized but since you have no idea what’s happening and the feature space is very high, you may make such kind of error which is really in some sense is socially wrong but I don’t think algorithmically there is anything wrong in that.

To conclude, we tried to discuss in terms of how to learn the feature space, the kind of dimensionalities aspect that we touched upon, and we also looked at many people talk about the semantic. So actually in my opinion it is not the semantic, it’s just that you need some kind of a guide to get them separated and of course this is more what an oriented edge filter is to visual cortex, which is a classical thing from Hubel and Wiesel. Now people claim that the CNN features is the same to neural network, you know is it the same or equivalent, but we don’t know and may be in feature can tell us.

           Prof. Subhasis Chaudhuri receiving the
           ACCS CDAC Foundation Award 
           from Prof.  Chatterjee.
           Jailendra Kumar, President ACCS looks on.