Tip toe into empirical formalization of the observer

This writing is a direct continuation of prior writing of second order observer formalization. You can read the prior one on the link: Formalizing observer behind the observer

After writing the first prior piece, I've made some step progress towards experimentation direction and discovered fun intuitions and ideas. 

Misnomer practice of awareness:

As I tried to look more into how humans perceive the world, I couldn't stop myself from noticing that we do not view any object as a whole at a single instance of time. In fact, our awareness shifts through various parts of the object with no defined trajectory. Awareness on any object more than 1-dimensional point, needs to be traverse. For example: simple objects like a small straight line needs to be traversed from one side to the other. As such I couldn't stop myself from wondering why we segment object completely while training any form of image or video models. That simply does not teach how we view the world and our inner self. 

Fig: Shifting awareness

Sounds great and all but how do we even model this behavior?

To think about this, let's imagine ourselves from the lens of a newborn child who opens their eyes to see clear blue sky above them. Nothing changes, it's the same blue sky for a few minutes. The vision/representation has low entropy. Then suddenly, a black raven flies across the blue sky. The raven which if viewed as an object itself is low entropy, but it's motion across low entropy background increases randomness in the visual frame and thus increases entropy. Let's call this change in entropy due to this new low entropy object like raven as "relative entropy". Each attention frame in vision representation has its own relative entropy, which is mostly 0, only high in places where the low entropy object is present. 
Fig: Disturbance on a low entropy frame
                    
A nice way to visualize this is by imagining the entire frame as a water in a pond and the object causes disturbance wave as it moves through the pond. The relative entropy can be measured as the strength of this disturbance wave at any point in the vision frame. Meanwhile the strength of the disturbance wave across the axes can be tunable hyperparameter. So, naturally it's a good idea to start training from videos then transfer the learning to similar images. 

How would this training process look like? 

To start, we need to sample training videos and manually a video of same length and aspect ratio where the motion of awareness blob represents human awareness. The neural net would take training video sample as input then predict the motion of awareness blob. Then the motion of awareness blob would be compared to the video of correct awareness motion. Error gradient would be the difference between predicted motion of the awareness vector with the actual and be propagated as usual. This way, we'll be training the model to match the awareness direction just like humans do. 

Would this easily transfer to images?

I doubt a simple inference would be enough. Since image is just a single video frame, that would mean we would be inferencing on a much less rich representation. There would be various problems with simple transfer because on an image the "relative entropy is constant everywhere". 

Okay, how do we resolve this issue? 

A good thing is even humans face the same problem. If you had no priors about the world and were shown an image your awareness blob would be all over the place. Luckily, due to the training process bias due to repetitive dataset is the "Bayesian prior". What I mean by this is, if you train the model with videos of drawing digits (0-9) on a thin white paper, then you make prediction on image containing digits, the awareness blob would likely point towards the digit in the image. Of course, this also means you need a lot of diverse training before the model gets a general idea of what to focus on. Curating dataset for this might be nightmare. 

I beg to differ. There's a simple trick that we can use to guide its awareness real time. That trick is, ah hum, roll drums "Reinforcement learning" aka RL. But hold on, this is not your usual RL as the task is not verifiable, so we use human feedback. Not in your usual way of course. An elegant way would be to first train a voice model that classifies sound: "Yes", "No" and "Other".  Naturally, then we do training on real time video stream, if the model is correctly positioning its awareness blob, we simply say "Yes" which would give positive feedback, "No" for negative feedback, and ignore for "Other". Depending upon the feedback it receives, the model will update its weight. 

"I smell continual learning" 

Everything looks great, are we done? 

To be honest, not quite. There's this other thing that seem to be missing in current frontier models. And that is the sense of time. All of the training that we're doing whether they are inform of text, audio, images, video, are all some form of positions in space. But we live in a space time not only space. At least our most accurate model of the universe is space time. Which is why the notion of time plays a very important role in creativity. For example: For a caveman to draw art in a cave the caveman must have a strong notion of past, present and future. The art is meant to preserve the present or the past. This is possible only if the caveman makes sense of all three time.  These sense of time-based abstraction allows more creative exploration of say the idea space. The more precise the notion of time, the more complex and creative our ideas become. So, a wavy conjecture would be to prose that: "the time period where human creativity suddenly exploded has a high correlation to when human developed a finer precision of time". 

Bayesian prior is completely focused on repetition of the information, not when the information was acquired. We tend to remember information learned recently quite well, whereas we only recall long term memories if we used to repeat it more or moments that brain classifies as essential. My point is "time acts as a filter for Bayesian priors". So, the models should also somehow account for the learned time of some information. If an information used to be repeated few years ago, but not recently then the models should provide lower weightage to the information. Again, most of this is empirical with a good chance of failure. 

Comments

Popular Posts