"in the wild" here just means without any special studio or setup for calculating objects' positions, just casual everyday photos.
If we know our light source, we know that the brightness at a point on the image depends on the normal, so we can backwards construct the normals from the brightness in the image to create the 3D mesh
This problem is hard because technically there could be infinite different answers. It is simply a matter of what seems the most realistic
This seems to demonstrate the point from a few slides ago that one of the challenges of this kind of prediction is how data hungry it is. The training data might need a lot of data points with occluded legs and labels that tell the model what the legs should look like when they're obscured.
Does this mean that the model has to have a lot of context about all of the different objects that exist in the world? For example, how would it know what the back of a bench or bike looks like?
Were the backs of these people automatically generated? If so, it's so cool that the computer was able to infer what the backs might look like based on the front!
Having taken the VR and UCBUGG decals, I'd say both are very hands-on and are related to many of the topics explored in this class. If I had more time at Cal, I'd definitely be interested in taking the game design decal as well!
I was wondering the same thing! Did some quick research and it turns out that our ability to perceive depth is highly reliant on our "binocular vision", or the ability of our brains to put images from each of our eyes together. Because each eye is located at a slightly different position, the eyes relay images from slightly different angles and the brain is able to synthesize them to more accurately judge an object's distance. There are other factors that play a part in our depth perception, including monocular (one eye) cues like relative object size and motion that do not rely on binocular vision - but binocular vision plays a huge role, which is why people who are only able to see through one eye have a substantially decreased ability to gauge depth. (Try playing a sport with one eye closed to see how it affects you!)
Another interesting question would be not only if we are able to automate models, but if AI can be able to place these models in reasonable locations to resemble a city. I feel like the composition of scenes might still be pretty far off for AI.
Our depth perception is finely tuned to the two slightly different images that our eyes see. Especially with XR headsets that try to recreate a 3D environment with 2D images, we can see some interesting facts about how our eye works, like the fact that our depth perception is likely highly dependent on environment cues. There's an interesting paper here: https://ieeexplore.ieee.org/document/8985328
It's interesting to think about what exactly knowledge of "what the human body is like" entails. In addition to visual similarities to images we have seen before, there's also intuition about the physical world, ideas about how we would recreate the pose ourselves if we wanted to, and even external context (e.g. "skateboarding"). Things like these let us guess at the positioning of the hidden arm in the picture. How much is learned from our experience after birth, and how much is in-born?
I find it pretty wild that this one function can describe every point in the universe at any time it exists.
About the blind spot, something very interesting you can do is cover up one of your eyes, then hold out your thumb at arms length. Stare at one spot, and move around your thumb in your field of view until you can't see it anymore. Now it's in your blind spot.
It's definitely interesting to see all these possible explanations for this image, especially the sculptor's explanation. It shows how complex this problem is, and how amazing that computers are even able to guess which explanations are more likely than others.
How close are we to being able to automate creating these models rather than paying people to draw them? Asking because I always thought this sort of tech was very far away from where we are right now.
The website associated with this is pretty cool -- you can render an image from all sorts of viewing angles in real time.
@philippe-eecs I'm not sure about the specifics on the model being used here, but generally the state-of-the-art neural nets are somewhat able to generate 3d models of the human body without any prior knowledge of the specific human. Prof. Kanazawa has a more recent paper here that proposes an algorithm to do this: https://akanazawa.github.io/hmr/
I'm going to take CS 194 this fall, which is something that I've been waiting to take since I saw someone doing an assignment for it during my freshman year.
Does this mean that early VR involved scanning what is essentially a QR code for every angle of the room, every time you'd turn your head?
You can invert these Neural Radiance Fields in order to figure out the orientation of an object in some target image.
I wonder then, if this is the case, what gives us the perception of depth?
Does the model need to be trained specifically with videos of athletes playing ping pong? If not, does training the model with ping pong videos gives a more accurate mesh recovery for ping pong video?
It seems to me the distance-based NNs don't necessarily capture the actual 3D shape of the object since it only cares about the closest point to the viewer. If the network takes in the viewing direction as a parameter, then it does need to know the full 3D shape of the object. Occupancy-based networks apparently seem to capture the full 3D shape of the object.
Below when we talked about coordinate-based neural networks, we are optimizing using gradient descent. Is the optimization problem here similar to that below, or is it more like some sort of convex optimization problems?
@wcyjames, that's totally possible, and in general what the predictions look like for things with occlusion largely depends on training data. What I'm more curious about it, assuming the training data contains roughly an equal amount of standing/sitting/lying positions, then will the prediction be one of them with some probability, or could it be a monstrous mixture of all of them?
Would it be that hard to put depth cameras (e.g. Xbox Kinects) in the world where there are lots of traffic to capture people? Is privacy the main concern?
I learned in CS 294-137 last semester that humans deal with this with a lot of different mechanisms. For example, our two eyes capture different images, and the brain can use the differences between them to infer what the 3D object looks like. The brain also uses the changes in captured images when we look at an object from slightly different angles to infer its 3D properties.
I took CS 294-137 last semester and it was very interesting. It touched on lots of eye-opening concepts (like how senses work) and we got to implement really cool AR/VR applications in Unity which run on our phones. Highly recommended!
I would assume you would infer, I think this is where a prior would be super important. A 2d image would not give you enough information to reconstruct a 3D object, we'd need hefty priors on the human shape to know that there should be a right arm present behind the person. Or we'd need multiple pictures with different views.
^For above yea it's def possible, GANs can reconstruct faces super well I would imagine constructing a human would be not too different.
This is very similar to learning key points from images, you essentially force the key points into a bottleneck and take a reconstructive loss on the output.
Is there a well formed latent space of which you can draw human models and bodies? Like PCA would give you one interesting way to break up a method, but perhaps a learned latent space from an unsupervised deep learning algorithm would work very well - what would the latent space info hold?
Do you need a priors on the 3D model to construct this? Obviously we know a ton about the human shape, but this seems pretty much impossible to do without prior info on the subject you are constructing. You would usually need multiple images to properly construct a 3D model.
How do we predict the right arm if that's no appearing in the image? Would it be deviated from the reality when reconstructing into 3D scene for this?
It's easy to see that when we observe the 2D mapping from 3D objects it's hard to know what would the object exactly be like. Reconstructing 3D object then requires other aspects of data.
Would those graduate-level courses require additional background besides what we learn in CS184?
I wonder whether some of these classes depend on another? Or we can take any of them directly? And I'm very interested in the class of computational color but I don't find much detailed information about this class. What kind of project will be done in this class?
I think processing videos would be much slower than single photos since it needs much more computation. How can we deal with the problems with speed? Do we just use a more advanced devices or we have some different algorithms for video processing and image processing?
I want to confirm my understanding for this topic. Is it that we can recover the motion/different perspectives according to a single picture? I think it's very cool and the 3D reconstruction can be applied in many fields, like VR
The reduction of 3D objects into 2D reminds of the similar reduction into visible color space by human eyes. However, it is obviously impossible to trick brains in a way similar to color cognition by restrict human perception to 2D. :(
It seems that the introduction of a discriminator network here can be easily connected to GANs. I wonder if we gan generate neutral body poses by training a generator network with the discriminator.
Presumably yes, but feeding 3D objects to neural networks may not be as efficient or effective as inputing the original 2D images. 3D convolutions can be costly and we're essentially forcing the structure of the neural network into 2 separate stages. However, adding a discriminator network for classification could potentially improve the reconstruction network.
Can it also be used for video/motion/object classification?
It seems like people in the background are not detected, although they are quite distinguishable from the background lawn. Is it because they are too small?
I guess prior assumptions about ping pong would definitely help with the accuracy of the model although it can be extrapolated from general human movement data.
I am wondering how many ground truth images of different views are generally required to produce a high-quality scene, and if there are possible improvements to reduce it.
To predict on occluded body parts, would it largely depend on the training data? I am wondering if there are more scenes that people work while standing in the training data, then will the model be more likely to predict standing instead of sitting?
I am wondering how much the biased locations of the annotated joints matter for training, and if there are some models have achieved certain fidelity to help researchers annotate the joints.
If you're curious about deep learning specifics, one of the contributions in the paper which the authors found to be important for getting high-quality results was to apply a positional encoding to each of the 5 inputs to the MLP, very similar to the positional encoding used to provide location information to Transformer networks. I wonder if this trend of using such encodings to better fit high frequency features in the output can be replicated across a variety of domains.
Is it possible to include depth data somehow? Even in a movement lab that does not reflect the real world, it is a good starting point and can be readjusted by humans afterwards
In fact, the same class also has project instructions on doing this! I highly recommend doing it even if you don't take the class if this sounds interesting. One thing that's really interesting is the use of dynamic programming to tile two "good patches" together.
The way additional hardware improves results almost makes me wonder if the current research focus and promotion on 2D photographs are a mistake? I've seen and we've talked about some advanced cameras that have infrared tracking that can sense depth information, and I'm sure there are more advanced ones available. Could those provide better features in, say, Machine Learning?
This is pretty crazy considering modern display pipelines already have ~15ms of lag from signal to render. I wonder what are the current bottlenecks?
is the distortion/warping a constant "filter" or transformation like effect that can simply be placed over the scene, or will it differ depending on perspective, position, and viewpoint?
when simulating an environment in VR, how is the scene displayed to the user in a way that makes it seem realistic? are effects like focus, background blurring, etc applied depending on where the user looks, or is everything in focus?
what are some of the differences in implementation of VR and AR, or are they more or less the same? I'd imagine AR might be a little more difficult since it has to take in real time input of the world and match its display to the user's head movements? but VR is likely more computationally intensive since it has to simulate everything?
I'm still not super sure -- does increasing the ISO increase or decrease the SNR? (i.e. do we count the magnification from the ISO as more photons?)
I wonder what is usually considered an acceptable amount of latency for VR / what degrees or how many pixels off can the display be before it becomes noticeable or a big problem for users
This article https://www.frontiersin.org/articles/10.3389/frobt.2020.00004/full suggests that the main reason why women are more likely to feel discomfort / specifically motion sickness than men for VR headsets is due to a mismatch between the headset and the wearer's IPD (Inter Pupillary distance). The results say that most VR headsets actually don't have a large enough IPD adjustable range to accommodate the majority of women VR users. Really shows the importance of having a diverse range of testers for a product!!
I love this idea where we can "go into" our paintings. I remember struggling to visualize and draw 3D shapes but with this, one will be able to walk into the object they draw and view them in a different viewpoints!
This makes me wonder if we will be able to do VR version of an actual live concert in realtime. Will it something similar to YouTube 360° Live Streaming?
Is this also why some of us will experience motion sickness while wearing VR headset? Also wonder if similar thing will happen with AR? It also seems like women are more likely to feel discomfort from this than men :/
So I wonder is the hardware technology one of the major difficulties in the production of VR glasses? I understand that it needs some theories to back up the production. But I wonder does it have a very high requirement for the hardware perspective?
So if people of different ages have different focus of eyes, does the VR glasses need adjustment when used by different people?
I've seen a conceptual advertisement about VR-glasses. When you're wearing the glass and walking on the street, for example, you can see the navigation map in front of you when you say you need the map. Also, when you want some entertainment, the glass can play some movies for you just in front of you eyes. I think it's very cool.
It's really interesting to see the history of VR as presented in these slides. I guess we can think of this contraption as 1968's Oculus.
The google cardboard presents the minimum qualities of VR. We can see that there are two different images to emulate stereo viewing and the light from these images is refocused into the eye with the lenses. We also track the direction using the phone's gyro (usually) but don't get to "move" in the scene. We are fixed at one point and can see a 360 degree view from that point only because we don't track the position of the head.
In this function, theta and phi parameterize the intensity of light coming from a certain point in the scene. Lambda parameterizes the wavelength/color, and t allows for time variation--how are these other parameters changing. Finally, we have the V variables that allow our viewpoint to change, indicating location in the scene.
As noted in an earlier slide, humans are less sensitive to chromaticity than luminance. We can see this in the downsampled comparison of luma vs CbCr channels; we can see a stark difference in quality when we downsample luma channels, but hardly notice any difference when downsampling CbCr channels. Following, it makes sense to compress the CbCr channels in order to decrease memory usage but preserve quality
This moire pattern/aliasing occurs because we are only collecting light from one quadrant of each pixel
We use microlenses as seen in the diagram to refocus light so that it hits the photosensitive portion of the pixel. Without the microlens, light rays arriving at the edge of the color filter and passing straight through would hit the circuitry/non-photosensitive portion of the pixel and not contribute to the integration of incoming light.
I believe it's age, and it's saying that as you get older the focal range changes
How would you design VR to not give people eye strain or induce dizziness? I know that a lot of people can't handle looking at screens for a long time, and excessive motion could be sickening
Personally, I've found those 360 youtube videos where you can move your phone around and see the whole scene to be cool, but also kind of annoying, since I always feel like there's something behind me that I'm missing
Adding on to the previous comment, detecting any facial expressions would be important in a realistic simulation for VR conferences like these, to be able to know who is currently speaking and to get a sense of participant reactions around the room.
could there be a way to form the cardboard lenses so that there's a 3d effect in addition to the display that tracks head position?
This approach reminds me a lot of the JPEG compression lecture since we are taking advantage of human perception. However, in this case, since there are very few cones in our peripheral vision, I would think that we would want to compress even more in terms of color.
I think one potential opportunity to get lower latency is to predict potential movements of the user and then precompute the potential scenes in advance. I think the major challenge with this approach would be how precisely you need to predict the movements such that the precomputed scene is approximately what the user would actually see based on their exact movements.
For this approach of VR imaging, we are trying to capture a real world scene using many cameras to get sufficient resolution at every viewpoint. Would an alternative for scenes including known objects be to create models for each object and place the objects in a virtual scene?
It is interesting how far this is from the plenoptic function (missing all possible locations, not even all angles in 2 axes) and yet we are able to simulate virtual reality.
Google cardboard is the cheapest VR headset, but offers an amazing VR experience.
Google glass is also a AR device, but it only offers a floating screen on the top right corner.
I think there are really fast non-local means methods that allow you to search based off of vector distances/similarities in the features you are querying for?
What happens if edges are purely diagonal? Will the Gx and Gy still capture the right responses?
https://mitpress.mit.edu/books/complexity-robot-motion-planning this idea won the 1987 ACM dissertation award!
I am wondering what the "years" mean in this context.
VR cameras capture images or video files, some of which can either automatically stitches them together in-camera, or offer softwares with which one can stitch the files together.
Just wondering if the stereo vergence will cause human eye discomfort in some circumstances?
Oculus has moved to the camera+gyro setup for their Quest line of standalone headsets and stopped selling the camera+marker Rift products. The depth sensing technologies (mostly structured light) mentioned here are also widely deployed into consumer devices including Xbox Kinect, laptops with Windows Hello, and iOS devices with Face ID.
Stereo Vergence helps to expand the view that goes into our eyes, the overlapping part is the main image that human eyes see, also, each individual eye could add more information about the surrounding world
This was an interesting slide. The way that I thought about this was that this process is a combination of the previous two topics we talked about, bone maneuvering and the face expression. I think that this is combining both of those topics meaning that not only do we have to combine bone movement but now the flesh of each bone has to be connected with realistic features of the object. I really liked this new model of representation.
Can we apply the absolute transform before the relative transform?
Particle is like the most simple and basic atom of an object.