Introduction
Rendering
Graphics Pipeline
- Vertex Procesing
- Rasterizer
VR Display
Stereo Display
- HMD Display
- Varifocal Display
Sound
Tracking
- IMU Tracking
- Optical Tracking
Human Factors
Interaction

Introduction

Formal definition: Using targeted behavior in an organism using artificial sensory simulation with little or no awareness of the interference on the part of the organism
- Targeted behavior: A man-made experience
- Organism: Any organism, not just human
- Targeted behavior: One or more senses are “taken over” (at least partially) by the virtual world
- No awareness: “Fooled” to feel like the real world; sense of presence
- Music, movies, and paintings can be thought of as “virtual reality” through this definition
Defined by Immanuel Kant as the reality in someone’s mind
Jaron Lanier also defined a real world (the physical world) and a virtual world (the perceived world)
Different terms for VR: Augmented reality (AR), mixed reality (MR), XR, telepresence, teleoperation
Open Loop vs. Closed Loop: Open loop systems don’t allow for the user to interact, while closed loop systems do
Components
- Tracking: Input from user, looks at hand, head, body, etc. movements
- Software: Renders and controls the virtual world
  - Maintains consistency between real world and virtual world
  - Matched zone; users should be able to walk in the real world to walk in the virtual one
- Display: Outputs the virtual world to the user
- The computer links all these things together
A VR headset uses two different images for your two eyes in order to create the illusion of depth
- Instead of obscuring all other vision, AR uses pass-through monitors as lenses in order to project virtual objects onto the real world
- SAR wants to get rid of any wearables (i.e. headsets) and allow for seamless merging between the virtual and real worlds
Some challenges with VR headsets
- Vergence: Headsets will cannot emulate aspects of depth; eyes will try to focus on something far away, but the screen will stay the same, causing discomfort to the user
- Law of Weber and Stephen: Users will be able to physically feel a difference depending on their stimulus
  - $P=KS^n$, where $K = \frac{\text{Difference Threshold}}{\text{Standard Weight}}$ is the Weber fraction, $P$ is the perception, and $S$ is the stimulus strength
  - If $n>1$, then we have expansion; if $n<1$, we have compression
    - Electric shocks follow expansion (double the shock is more than double the pain), whereas brightnesss follows compression (double the light is less than double the brightness)
- McGurk Effect: If the lip sync and audio are different, you hear something different
Early VR headsets included stereoscopes, HMD, Nintendo’s Virtual Boy

Rendering

Has multiple inputs
- 3D world: objects, lights, materials, textures
- Camera location, orientation, FoV
Output: 2D image of the world from the camera
Graphics Pipeline
- Modeling: Coordinate system and objects
- Viewing: Camera/eye, gets rid of objects not being seen
- Illumination and shading
- Rasterization: Creating a 2D image from the 3D world
- Texture mapping
Triangle Soup Model
- Vertices have a number of attributes, such as coordinates, colors, normals
  - Normals define the direction a face is oriented; can be calculated by averaging the normals of the nearby vertices
- Triangles are defined as objects that connect vertices
Techniques
- Rasterization: Project vertices from 3D onto 2D space and draw triangles between them to represent the polygons; done by the GPU
- Interpolation: Automatically generating transitions between colors, frames, polygons, etc.
  - Creates interpolation coefficients by averaging out the colors/normals of nearby vertices
Transformations
- Scaling: Apply a scaling matrix, defined as $S(s_x, s_y, s_z)$ onto a point to transform it
  - Matrix has parameters on the diagonal; can be reversed using the inverse matrix, which is equivalent to $S(1/s_x, 1/s_y, 1/s_z)$
- Rotation: Apply a rotation matrix which rotates a point about one of the three axes using sine and cosine
- Translation: Must use a 4D matrix and convert the 3-vector into a 4-vector
  - Homogenous coordinates: A 3-vector and 4-vector representing the same point in 3D; append an extra 1
  - All of the previous transformations can be converted into 4D matrices in order to work with the homogenous representation
- Shearing: Translating an object about two out of the three axes by a value proportional to the third axis; affects shape of the object
Can concatenate different transformations onto each other to perform complex operations
- Most notable is rotating/scaling about a fixed point by translating to the origin, performing the transformation, and inverting the first translation
An affine transformation is any transformation using a 4x4 matrix where the last row is 0 0 0 1
- Degree of curve can’t be changed, and parallel lines cannot become intersecting lines
In projective transformations, parallel lines can intersect and vice versa; used when rendering using a pinhole camera

Graphics Pipeline

Input: Soup of triangles
Output: Image from a particular viewpoint, produced in real time
- The output is put into the frame buffer, which is updated according to the FPS and sent to the monitor
Steps (done in the GPU)
- Vertex Processing: Process the vertices and normals
  - Performs transformations on points and per vertex lighting
  - Transformations include model, view, and projection transforms
- Rasterization: Convert vertices into a set of fragments (triangles)
- Fragment Processing: Process individual fragments
  - Performs texturing and per fragment lighting
- Output Merging: Combine 3D fragments into 2D space for the display

Vertex Procesing

Model Transform: Begins by arranging the objects in the world using a model transform
- Involves scaling, rotation, translation, shear transformations to propagate the world space
View Transform: Positions and orients the camera using a view transform
- Translate the camera to the origin ($T$) and then rotate it appropriately ($R$); final transformation is $M = RT$
Projection Transform: Defines properties of the camera (FOV, lens) and projects the 3D space onto the camera using a projection transform
- Uses gaze direction (shear), FOV, aspect ratio, near plane (image plane), and far plane (cuts off rest of scene) to create the 2D image
- Displays all objects inside of the view frustum which is a 3D object connecting the near and far planes
  - Must be normalized (to a cube) so conversion to window coordinates is easy
  - Requires shear, scale, and projective transformation to convert
- Final transformation matrix: ${v_{clip} = M_{proj} \cdot M_{view} \cdot M_{model} \cdot v}$
- Must clip objects to fit on screen by transforming coordinates again
- z-coordinate is retained for occlusion
- To make it look like the camera is coming from the right perspective, a perspective projection is applied which uses translation (to move the camera), shear (to change the LookAt value), and scaling (to convert to cuboid)
A Viewport Transform is performed
- Uses translation to move the nearplane to the center of the window and scaling to scale it to the right size
TLDR: Takes 3D vertices and puts them on a 2D screen, ensuring that only vertices that are “on-screen” are rendered

Rasterizer

After clipping and retriangulating, the rasterizer “fills in” the interiors of triangles
- Main issue: Edges of triangles are non-integer, but pixels on a screen are integer; must interpolate vertex attributes onto the pixels
One strategy is to use scanline interpolation which involves sweeping across the 2D plane to determine which pixels fall inside of a triangle and interpolating the pixel accordingly
- The z-coordinate allows the rasterizer to figure out which shapes take precedence over others via the depth buffer; known as occlusion resolution
The rasterizer also performs lighting and shading which is extremely difficult to model due to the vast amount of light sources (direct + indirect illumination)
Lighting
- A simple model would be to remove indirect lighting and to replace with one lighting term
- Phong illumination/lighting calculates three channels (ambient, diffuse, specular) and combinews them to light objects
  - Requires a material color and a light color for each channel
  - Ambient: viewer-independent, acts as a “background color”, approximates indirect illumniation
    - Formula: $m\cdot l$; does not depend on viewer, normals, or light
  - Diffuse: light coming off of a surface, relies on the angle of lighting and the normal of the surface, approximates some aspects of direct illumination
    - Formula: $m\cdot l \cdot \max(L\cdot N, 0)$, where $L\cdot N$ is the dot product between the light and the normal angles
  - Specular: Reflected light depending on where the viewer is standing, models the shininess of objects, approximates some aspects of direct illumination
    - Formula: $m\cdot l \cdot \max(R\cdot V, 0)^{shininess}$, where $R\cdot V$ is the dot product of the refelector and viewer angles
    - $shininess$ is an additional parameter required for the specular channel
- Attentuation is used to model the falloff of light intensity w.r.t. distance
  - Formula for attentuation coefficient: $\frac{1}{k_c + k_ld_i + k_qd_i^2}$
Shading
- Shading is the acutal computation of the color for each pixel/fragment/vertex whereas lighting gives the model to do so
- Flat shading: Compute color once per triangle using some model (like Phong lighting)
  - Fast to compute, but looks very unrealistic
- Gourand shading: Compute color once per vertex and interpolate these colors to the triangles
  - A little slower than flat shading, but looks a little more realistic
- Phong shading: Compute color once per fragment which requires interpolation of per-vertex normals
  - Most realistic, but slowest strategy
- Vertex shading will be done before the rasterizer while fragment shading will be done afterwards
Texture mapping is also done in the rasterizer, and the main operation is to map coordinates from the 3D surface (x, y, z) into 2D texture coordinates (u, v)
- These coordinates are interpolated for each fragment
- Since texture coordinates are not guaranteed to be integer, we use texture filtering and use bilinear interpolation

VR Display

Various issues with having a high quality display
- Visual acuity: VR world must be precise, like reality - should be able to see 15 pixels per inch from 20 feet away
  - Varies over different eyes
- Visual field: Seeing with one eye vs. two eyes means that the FOOV should be different
- Temporal resolution: Video should be smooth in order to not be offputting
- Depth: Depth is hard to work with; many issues exist like disparity, vergence, accomodation, blur
Biology of vision
- Eye peforms low-level processing, brain does high-level
- Most of the eye’s functions are concentrated within a small part of the eye
- Eye uses cones and rods to process color/light; short, medium, and long wave cones process RGB colors
- Each eye has its own monocular visual field that it can see
  - The binocular (stereo) visual field is the overlap of the two eyes; allows you to see depth
- Monocular = periphery, binocular = fovea
  - Can only see color in fovea
- The total visual field is about 200 degrees; the binocular visual field is about 120 degrees
Headsets are limited because there is a minimum distance that people are able to focus at
- Strategy: Use two lenses, increasing the FOV and weight
Visual acuity is difficult to achieve because the photoreceptors capture 1 arcmin of visual angle
- Leads to requiring a massive amount of pixels; high compute and data requirements
- The eye will see a certain number of “cycles” in one degree, where there are two pixels per cycle (high and low)
  - This is known as the resolution, more cycles per degree means better resolution
Minimum Angle of Resolution: $\omega = me + \omega_0$
- $\omega$: resolution in degrees per cycle
- $e$: angle per eccentricity in degrees
- $\omega_0$: smallest resolvable angle in fovea in degrees per cycle
Visual acuity can be calculated as the reciprocal of MAR ($\omega$)
- Use MAR to accomplish foveated rendering: Split the image into multiple layers (inner/foveal, middle, outer) based on MAR and render them as more/less blurry
  - Less compute since less pixels have to be rendered; great speedup
Depth can be perceived in both binocular and monocular vision
- Monocular: Accomodation, retinal blue, motion parallax
- Binocular: (Con)vergence, disparity
- Pictorial cues: Shading, perspective, texture
Vergence: Convergence of muscles to fixate on a single object
Accomodation: Ability of lens to focus on fixated object
Vergence-Accomodation Conflict: Eyes try to focus on the screen in front of them, but the screen “fools” them into perceiving depth, creating a difference in vergence and accomodation; causes much fatigure
Motion parallax: Ability to see different objects while moving
Retinal blur: Blurring of objects when eyes focus on something different
- Blur can be calculated using $c_z = ar\left(\frac{1}{f} - \frac{1}{z}\right)$
  - $r$: Distance to sensor
  - $a$: Aperture, controlled by pupil (such as by squinting)
  - $f$: Focal length, controlled by accomodation
  - $z$: The depth, $d$, that is focused on

Stereo Display

One way to achieve stereo vision using 2D is through glasses
- Can use anaglyph, polarization, etc.
- General strategy: have each eye see a different color/polarization/etc., forcing them to work together to see a 3D image
- Polarization: Each eye sees different rows + columns, i.e. right eye sees even while left eye sees odd (different polarizations)
  - Popular example is RealD
  - Inexpensive and makes gaze direction irrelevant
  - Screen must be polarization-preserving, and this strategy loses brightness + resolution
- Shutter: Each lens opens and closes exactly when the frame changes
  - Somewhat expensive, requires fast display, screen must be synced with glasses
  - Active glasses
- Chromatic Filters: Uses two projectors to project two different colored images (one for each eye)
  - Somewhat expensive, can’t use in theaters
- Anaglyph: Render stereo images in different colors so that each eye sees a different image
  - Most inexpensive glasses
  - Has issues with colors
  - To create a full-color anaglyph image, render a left and right image then color left using red channel and right using green + blue (red-cyan anaglyph)
Parallax: The relative distance of a 3D point projected into the 2 stereo images
- Positive parallax means point is behind projection plane, zero parallax means point is on the plane, negative means point is in front
- Must use horizontal parallax where both eyes have the same projection plane (screen), known as off-axis projection
  - Projection previously was only from one viewpoint, but now that there are two, we have to account for the horizontal parallax

HMD Display

Basic idea: Have a lens that is a short distance away from a micro display
Lens magnifies the virtual image to appear at a realistic distance away from the viewer
Must account for pincushion (inwards) or barrel (outwards) distortion created by the lens by applying the opposite distortion

Varifocal Display

Use an actuator and an eye-tracking camera to move the screen away/towards the viewer based on where they are looking in order to remove the vergence-accomodation issue
- Instead of an actuator that moves the screen, a tunable lens with varying focus can be employed to change the distance; more expensive but less technically complicated

Sound

Sound is the vibration of air particles, and the eardrum is able to hear sound by detecting the vibrations
- Can be simulated using bone conduction
Stereophonic sound does not reach both ears at the same time; depends on location of sound, leading to a phase difference and amplitude difference
Head Related Impulse Response (HRIR): The amplitude of sound heard by each ear, can be modeled using the Dirac delta function (along with some noise due to dampening + scattering to be more realistic)
Two types of sound: point sources and ambient sound
- Point source: Find the HRIR for each ear and play the appropriate sound according to the delta function
- Ambient sound: Sample where the sound is coming from in multiple places, and play them from those places
  - Think of a living room setup: 6-8 speakers playing sounds to create surround sound
  - Can sample using spherical harmonics

Tracking

Need to track the position and orientation of the head and hands by either using Inertial Measurement Units (IMUs) or Optical Tracking with cameras
Tracking requires sensors and angular measurements of yaw, pitch, and roll
Vestibular System: Provides a sense of balance and gravity, senses acceleration
Head orientation has 6 degrees of freedom
TODO: Add vertex in clip space

IMU Tracking

IMUs measure angular velocity ($'\omega$ in deg/s) with gyroscope, linear acceleration ($'\alpha$ in m/s²) with accelerometer, and magnetic field strength ($m$ in uT = microtesla) with magnetometer
- Accel and gyro measurements will accumulate bias because the noise will compound over time
- Can remove drift and noise using a filter (Kalman filter)
  - Gyrometer is accurate in short term, but will drift; requires high pass filtering
  - Accelerometer is accurate in long term, but has noise in short term; requires low pass filtering
- Magnetometer is sensitive to distortions; use hardware with low latency that is able to utilize mechanical, ultrasonic, or magnetic tracking
The Head and Neck Model provides more accurate position tracking by orienting everything relative to the base of the neck
- Handheld controllers can also be tracked using the head neck model using translation with respect to the head

Optical Tracking

Strategy: Use cameras to find the coordinates of the head/hands, extrapolating 3D coordinates from 2D camera images
- Known as the PnP (Point n’ Perspective) Problem
- $P’=KCP$ converts 4D homogenous coordinates to 3D
- $K$ is the camera intrinsic matrix, performs rotations and transformations, is known
- $C$ can be found by calibrating the camera
- (Pseudo) inverse matrices can be used to recover the 3D point
Two types of optical tracking
- Outside Looking In: Have lights on the headset that are distinguishable (via lights or indices) and track with outside cameras
- Inside Looking Out: Have cameras on the headset and markers in the outside world; more sophisticated and expensive technique (Apple Vision Pro uses this)
  - Can also be implemented using base stations that sweep the room with IR (infrared) light and track the headset position
- Apply both techniques to get the most accurate reading
Eye tracking techniques
- Shine a light into the eye; simple, but causes red eye
- Use camera and Purkinje images to track IR illumination
- Place electrosensors on eye muscles; very intrusive

Human Factors

Illusory motion: The belief that one is moving based on visual cues
- Also known as vection; used by VR to move the player, but internal human sense (proprioception) can override vection
Vector fields are used to plot movement on options on the screen, and certain vector fields (which represent different angular) velocities can make one feel dizzy
- e.g. a vector field of all arrows pointing to the right will make one feel like they’re spinning
The eye moves very fast; one saccade (movement) every 45ms, meaning that the eye rotates 900 degrees per second
- Smooth pursuit is when the eye moves slowly to track an object which reduces motion blur and stabilizes the image
- Vestibular Occular Reflex (VOR): Fixates on object as the head rotates
- Opto-Kinetic Reflex Watch close feature (for reference) instead of trying to track something fast
  - Might watch tree in front of car instead of car to determine how fast the car is
- These eye movements can lead to fatigue and motion sickness, especially when combined with vection
Reducing latency between head movements and scene changes is key to reduce discomfort
Frames are discrete, but vision is continuous, so frame skipping leads to desyncs
- To load frames faster, frame buffers are swapped to decrease load times
- Predictive tracking, advanced technology, OLED, etc. can also ease this issue
The Uncanny Valley provides issues for affinity to virtual environments

Interaction

Consists of locomotion which is the movement of a virtual avatar corresponding to movements in real life
Matched Zone: Since the VR area is larger than IRL, a matched zone is employed to map the IRL area to a small portion of the VR area
Remapping: The mapping of an action IRL to an action in VR (i.e. WASD to move)
- The greater the mismatch between VR and IRL, the more remapping required
Mismatched obstacles can either be dealt with by drawing a virtual boundary on IRL objects (breaks immersion) or freeze rendering when hitting a virtual object (can cause discomfort when moving IRL but not moving in VR)
Redirected walking lets people move IRL and feel like they’re moving in VR
- Subtly change angles and distances (e.g. move 2 m/s IRL and 4 m/s in VR) during saccades/blinks to travel greater distances
- Redirect users (teleportation, distractors) at the end of the matched zone
- Take advantage of change blindness by changing things when user isn’t looking; most users won’t notice
Can think of a matched zone as a cart that moves through the virtual world using controllers and head rotation
Starfing: Allowing lateral motion during speed up or slow down in one direction
Collision detection (between hand and scene) is expensive since you must compute millions of intersections between triangles
- Use a bounding volume to roughly see if objects collide, simpler structures lets you reject intersections easier
- Bounding volumes can be axis-aligned or object oriented
  - Spheres and rectangles have easy computations but are worse fits; shells and hulls have better fits but are harder to compute
- Axis-aligned boxes are the most common due to its fast compute and relatively good fit
To create a tight fit axis-aligned bounding box, engines use an octree data structure
- Generate one large bounding box (root node) and split it into 8 equally sized bounding boxes (children nodes); repeat this process for each node until the box is a tight fit
- Done in preprocessing
- To check for intersection, check root nodes of objects A and B and recurse down the tree until the intersection of two leaf nodes is found
Techniques for comfort
- No gorilla arms
- Select with flashlight as opposed to a laser pointer
- Don’t force players to continuously press a button; use a state machine (holding vs. not holding)