diff --git a/README.md b/README.md
index cad1abd..cb178a7 100644
--- a/README.md
+++ b/README.md
@@ -1,17 +1,87 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Daniel McCann
+* Tested on: Windows 10, i7-5700HQ CPU @ 2.70GHz, GeForce GTX 970M, 16 GB RAM
+
+### Overview
+
+![](./renders/raster.gif)
+
+This is a manual implementation of the real-time graphics pipeline provided by APIs like OpenGL, DirectX and Vulkan on CUDA.
+
+3D mesh objects are stored as a collection of triangles. Triangle vertices store positions, normals (surface direction at point) and other data like texture coordinates. First, the triangles are transformed to screen space (vertex shader) and their vertex data is transformed to view space for shading. Then the triangles are rasterized in parallel; they are split into fragments where they overlap with pixels on the screen to be colored later. Then the fragments are rendered (fragment shader). Finally the rendered image is modified and color corrected (post processing) and uploaded to the screen.
+
+### Extra Features
+
+# HDR Lighting and Tonemapping
+
+The rasterizer buffers are floats instead of bytes, allowing for color values to exceed 1. This means any nummber of lights can add color to a surface, and after the rendering steps are done the range will be shrunk back down to 0-1 using an exposure function. Then, the image is gamma corrected. (this also required textures to be gamma-uncorrected, done manually in this rasterizer)
+
+The current material is a blinn-phong shader with three directional lights.
+
+![](./renders/comparisonGamma.png)
+
+Above: with and without gamma correction. Without gamma correction, diffuse falloff is too gradual and dimmer lights contribute very little to the model.
+
+![](./renders/comparisonExposure.png)
+
+Above: exposure 0.5 and exposure 1.5. The increase in brightness is logarithmic. With higher exposure, the second and third light sources are more evident.
+
+# Bloom
+
+![](./renders/comparisonBloom.png)
+
+Above: the model with and without bloom.
+
+Bloom is a glow effect applied around especially bright areas. Since this requires many per-pixel operations, it is calculated on a quarter resolution and upscaled with a bilinear filter. The steps are as follows:
+* Apply a high pass filter to the kernel. Colors with HDR intensity are put into a new buffer.
+	* HDR intensity means dot(color, color) > 3. In non HDR color space, the max value of the dot product is 3.
+	* Four pixels of the original image will be averaged for this buffer.
+	* This uses a response curve: instead of having a hard cutoff, pixels are weighted by a smooth curve based on their intensity above 3. Right now a pixel with an intensity of 5 has full effect.
+* Apply horizontal gaussian blur to the new buffer
+* Apply vertical gaussian blur to the new buffer. This makes a circular blur.
+* Smoothly upscale the new buffer to full resolution and add it to the original color. Because the buffer was a blurred, upscaling artifacts are minimal. 
+
+![](./renders/comparisonBlur.png)
+
+Above: comparison without and with a second blur pass.
+
+# Perspective Correct Filtering
+
+Triangle attributes are normally interpolated with barycentric coordinates. But since the triangles have been projected to the 2D screen plane, they do not account for the z coordinate / depth properly. This means that shading attributes such as view space position, surface orientation, and texture coordinates are not accurate. Correcting this just requires a few extra weights based on the z coordinates of each vertex.
+
+# Bilinear Filtering
+
+Images are stored as arrays of bytes representing colors for each pixel. When picking colors from a texture with coordinated, picking individual colors can create an aliasing or jagged edge effect since there is no transition across texture pixels. With bilinear filtering, the renderer samples four pixels and creates a weighted blend.
+
+![](./renders/comparisonBilinear.png)
+
+Above, you can see the eye texture without and with bilinear filtering. For a negligible performance hit, you get smooth texture transitions.
+
+### Feature Performance
+
+# Pipeline Stages
+
+![](./renders/chart(1).png)
+
+Rasterize got considerably slower when the mesh is closer to the camera than the tile based rendering, but vertex transform and assembly seemed to be the slowest stage here.
+
+Post processing is also quite slow. However, it does not change when the mesh fills less of the screen, but the fragment shader does.
+
+# Feature FPS
+On the default camera, duck model:
 
-### (TODO: Your README)
+* All features: 275 FPS
+* 2 Pass Bloom: 248 FPS
+* Disabled Bloom: 386 FPS
+* Disabled Bilinear Filter: 277 FPS
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+Takeaways:
+* Blur kernels cost 27 fps. Therefore the downscaling / upscaling kernels cost 84 fps.
+* Bilinear filter saved 2 fps, almost nothing. The result is essential.
 
 
 ### Credits
diff --git a/renders/chart(1).png b/renders/chart(1).png
new file mode 100644
index 0000000..75032a1
Binary files /dev/null and b/renders/chart(1).png differ
diff --git a/renders/comparisonBilinear.png b/renders/comparisonBilinear.png
new file mode 100644
index 0000000..e986762
Binary files /dev/null and b/renders/comparisonBilinear.png differ
diff --git a/renders/comparisonBloom.png b/renders/comparisonBloom.png
new file mode 100644
index 0000000..9ea1c27
Binary files /dev/null and b/renders/comparisonBloom.png differ
diff --git a/renders/comparisonBlur.png b/renders/comparisonBlur.png
new file mode 100644
index 0000000..ced5e4d
Binary files /dev/null and b/renders/comparisonBlur.png differ
diff --git a/renders/comparisonExposure.png b/renders/comparisonExposure.png
new file mode 100644
index 0000000..bb8340f
Binary files /dev/null and b/renders/comparisonExposure.png differ
diff --git a/renders/comparisonGamma.png b/renders/comparisonGamma.png
new file mode 100644
index 0000000..576920a
Binary files /dev/null and b/renders/comparisonGamma.png differ
diff --git a/renders/defaultview.PNG b/renders/defaultview.PNG
new file mode 100644
index 0000000..00ee4ee
Binary files /dev/null and b/renders/defaultview.PNG differ
diff --git a/renders/defaultviewNoGamma.PNG b/renders/defaultviewNoGamma.PNG
new file mode 100644
index 0000000..6f23f6b
Binary files /dev/null and b/renders/defaultviewNoGamma.PNG differ
diff --git a/renders/exposure05.PNG b/renders/exposure05.PNG
new file mode 100644
index 0000000..ef8720a
Binary files /dev/null and b/renders/exposure05.PNG differ
diff --git a/renders/exposure15.PNG b/renders/exposure15.PNG
new file mode 100644
index 0000000..1179bf2
Binary files /dev/null and b/renders/exposure15.PNG differ
diff --git a/renders/extraBlurPass.PNG b/renders/extraBlurPass.PNG
new file mode 100644
index 0000000..32ed3c1
Binary files /dev/null and b/renders/extraBlurPass.PNG differ
diff --git a/renders/filterArtifacts.PNG b/renders/filterArtifacts.PNG
new file mode 100644
index 0000000..41d2997
Binary files /dev/null and b/renders/filterArtifacts.PNG differ
diff --git a/renders/filterBilinear.PNG b/renders/filterBilinear.PNG
new file mode 100644
index 0000000..498f94e
Binary files /dev/null and b/renders/filterBilinear.PNG differ
diff --git a/renders/noBilinear.PNG b/renders/noBilinear.PNG
new file mode 100644
index 0000000..b09e494
Binary files /dev/null and b/renders/noBilinear.PNG differ
diff --git a/renders/noBlurPass.PNG b/renders/noBlurPass.PNG
new file mode 100644
index 0000000..6f03bf4
Binary files /dev/null and b/renders/noBlurPass.PNG differ
diff --git a/renders/raster.gif b/renders/raster.gif
new file mode 100644
index 0000000..8f10926
Binary files /dev/null and b/renders/raster.gif differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..75f5b08 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -107,6 +107,7 @@ void runCuda() {
 	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
 		scale * ((float)width / (float)height),
 		-scale, scale, 1.0, 1000.0);
+	P = glm::perspective(glm::radians(45.0f), (float)width / (float)height, 0.1f, 1000.0f);
 
 	glm::mat4 V = glm::mat4(1.0f);
 
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..feaf914 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -9,6 +9,7 @@
 #include <cmath>
 #include <cstdio>
 #include <cuda.h>
+#include <chrono>
 #include <cuda_runtime.h>
 #include <thrust/random.h>
 #include <util/checkCUDAError.h>
@@ -18,6 +19,16 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+/// Features
+#define BLOOM 1
+#define BLOOM2PASS 0
+#define BILINEAR 1
+#define USE_TEXTURES 1
+
+/// Constant Settings
+#define GAMMA 2.2f
+#define EXPOSURE 1.5f
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -36,18 +47,11 @@ namespace {
 
 	struct VertexOut {
 		glm::vec4 pos;
-
-		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
+		glm::vec3 eyePos;
+		glm::vec3 eyeNor;
+		glm::vec2 texcoord0;
+		TextureData* dev_diffuseTex = NULL;
+		int texWidth, texHeight;
 	};
 
 	struct Primitive {
@@ -57,16 +61,12 @@ namespace {
 
 	struct Fragment {
 		glm::vec3 color;
-
-		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
+		glm::vec3 eyePos;	// eye space position used for shading
+		glm::vec3 eyeNor;
+		
+		glm::vec2 texcoord0;
+		TextureData* dev_diffuseTex;
+		int texWidth, texHeight;
 	};
 
 	struct PrimitiveDevBufPointers {
@@ -93,7 +93,6 @@ namespace {
 		// Vertex Out, vertex used for rasterization, this is changing every frame
 		VertexOut* dev_verticesOut;
 
-		// TODO: add more attributes when needed
 	};
 
 }
@@ -109,7 +108,16 @@ static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+#if BLOOM
+
+static glm::vec3 *dev_bloom1 = NULL;
+static glm::vec3 *dev_bloom2 = NULL;
+
+#endif
+
+static int * dev_depth = NULL;
+static int * dev_fragMutex = NULL;
+
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -133,6 +141,182 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     }
 }
 
+__global__
+void toneMap(const int w, const int h, glm::vec3 *framebuffer, const float gamma, const float exposure) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * w);
+
+	if (x < w && y < h) {
+		glm::vec3 col = framebuffer[index];
+		//col = glm::pow(col, glm::vec3(1.0f / gamma));
+		col = glm::vec3(1.0f) - glm::exp(-exposure * col);
+		col = glm::pow(col, glm::vec3(1.0f / gamma));
+		framebuffer[index] = col;
+	}
+}
+
+__device__ __host__
+glm::vec3 bytesToRGB(const TextureData* textureData, const int idx) {	
+	return glm::vec3(textureData[idx] / 255.f, textureData[idx + 1] / 255.f, textureData[idx + 2] / 255.f);
+}
+
+// get a texture color, 
+__device__ __host__
+glm::vec3 texture2D(const int w, const int h, const TextureData* textureData, const glm::vec2 UV) {
+	glm::vec2 uv = glm::mod(UV, glm::vec2(1.0f)); // repeat UV
+
+	float xf = floor(uv.x * w);
+	float yf = floor(uv.y * h);
+
+	int x = (int)xf;
+	int y = (int)yf;
+
+	glm::vec3 col;
+
+
+#if BILINEAR
+	float xw = uv.x * w - xf;
+	float yw = uv.y * h - yf;
+
+	glm::vec3 col00, col01, col10, col11;
+	col00 = bytesToRGB(textureData, 3 * (x + y * w));
+	col01 = bytesToRGB(textureData, 3 * (x + 1 + y * w));
+	col10 = bytesToRGB(textureData, 3 * (x + (y + 1) * w));
+	col11 = bytesToRGB(textureData, 3 * (x + 1 + (y + 1) * w));
+
+	col = (1.f - yw) * ((1.f - xw) * col00 + xw * col01) + yw * ((1.f - xw) * col10 + xw * col11);
+#else 
+	int idx = 3 * (x + y * w);
+	col = bytesToRGB(textureData, idx);
+#endif
+
+	// apply gamma correction
+	col = glm::pow(col, glm::vec3(GAMMA));
+
+	return col;
+}
+
+#if BLOOM
+
+// check for color components above 1, transfer to buffer with half res
+__global__
+void bloomHighPass(int wHalf, int hHalf, const glm::vec3 *framebuffer, glm::vec3 *bloombuffer) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int bloomIdx = x + y * wHalf;
+
+	if (x < wHalf && y < hHalf) {
+		glm::vec3 col = glm::vec3(0);
+		// get avg of 4 px from framebuffer
+		for (int yOff = 0; yOff <= 1; yOff++) {
+			for (int xOff = 0; xOff <= 1; xOff++) {
+				int x2 = 2 * x + xOff;
+				int y2 = 2 * y + yOff;
+
+				int fbIdx = x2 + y2 * (2 * wHalf);
+				glm::vec3 fbCol = framebuffer[fbIdx];
+
+				float intensity = dot(fbCol, fbCol);
+				intensity -= 3.f; // threshold
+				intensity *= 0.5f; // stretch response curve
+				intensity = intensity < 0.f ? 0.f : intensity; // clamp
+				intensity = intensity > 1.f ? 1.f : intensity;
+
+				intensity = intensity * intensity * (3.f - 2.f * intensity); // smoothstep
+
+				col += 0.25f * intensity * fbCol;
+			}
+		}
+		bloombuffer[bloomIdx] = col;
+	}
+}
+
+__global__
+void bloomHorizontalGather(int w, int h, const glm::vec3 *bufIn, glm::vec3 *bufOut) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int idx = x + y * w;
+
+	if (x < w && y < h) {
+		float weight[5] = { 0.227027027f, 0.194594595f, 0.121621622f, 0.054054054f, 0.016216216f};
+		glm::vec3 col = bufIn[idx] * weight[0];
+		for (int i = 1; i < 5; i++) {
+			int prev = x - i;
+			int next = x + i;
+			prev = prev < 0 ? 0 : prev;
+			next = next >= w ? w - 1 : next;
+
+			col += weight[i] * bufIn[prev + y * w];
+			col += weight[i] * bufIn[next + y * w];
+		}
+
+		bufOut[idx] = col;
+	}
+}
+
+__global__ 
+void bloomVerticalGather(int w, int h, const glm::vec3 *bufIn, glm::vec3 *bufOut) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int idx = x + y * w;
+
+	if (x < w && y < h) {
+		float weight[5] = { 0.227027027f, 0.194594595f, 0.121621622f, 0.054054054f, 0.016216216f };
+		glm::vec3 col = bufIn[idx] * weight[0];
+		for (int i = 1; i < 5; i++) {
+			int prev = y - i;
+			int next = y + i;
+			prev = prev < 0 ? 0 : prev;
+			next = next >= h ? h - 1 : next;
+
+			col += weight[i] * bufIn[x + prev * w];
+			col += weight[i] * bufIn[x + next * w];
+		}
+
+		bufOut[idx] = col;
+	}
+}
+
+__global__
+void bloomComposite(int w, int h, glm::vec3 *framebuffer, const glm::vec3 *bloombuffer) {
+	// going to bilinear upsample the bloomBuffer to get composite color
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int idx = x + y * w;
+
+	if (x < w && y < h) {
+		// get 4 samples of bloom buffer and interpolate
+		// if the current px is odd, it's in latter half of x / y of pixel
+		float wx = x & 1 ? 0.75f : 0.25f;
+		float wy = y & 1 ? 0.75f : 0.25f;
+
+		int wb = w / 2;
+		int hb = h / 2;
+		int xb = x / 2;
+		int yb = y / 2;
+
+		// quadrant offset
+		int x0 = x & 1 ? (xb) : (xb > 0 ? xb - 1 : 0);
+		int x1 = x & 1 ? (xb >= (wb - 1) ? wb - 1 : xb + 1) : (xb);
+
+		int y0 = y & 1 ? (yb) : (yb > 0 ? yb - 1 : 0);
+		int y1 = y & 1 ? (yb >= (hb - 1) ? hb - 1 : yb + 1) : (yb);
+
+		glm::vec3 col00, col01, col10, col11;
+
+		col00 = bloombuffer[x0 + y0 * wb];
+		col01 = bloombuffer[x1 + y0 * wb];
+		col10 = bloombuffer[x0 + y1 * wb];
+		col11 = bloombuffer[x1 + y1 * wb];
+
+		// add the color, HDR is resolved by tone mapping
+		framebuffer[idx] += wy * (wx * col00 + (1.f - wx) * col01) + (1.f - wy) * (wx * col10 + (1.f - wx) * col11);
+	}
+}
+
+#endif
+
 /** 
 * Writes fragment colors to the framebuffer
 */
@@ -143,10 +327,57 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int index = x + (y * w);
 
     if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+		if (glm::length(fragmentBuffer[index].color) < 0.0001f) {
+			framebuffer[index] = glm::vec3(0);
+			return;
+		}
+
+		glm::vec3 lightDir[3] = { 
+			glm::normalize(glm::vec3(1)),
+			glm::normalize(glm::vec3(-1, -0.1, -0.8)),
+			glm::normalize(glm::vec3(0, -1, 0)) 
+		};
+
+		float lightIntensity[3] = { 
+			1.5f, 0.3f, 0.2f 
+		};
+
+		glm::vec3 lightCol[3] = {
+			glm::vec3(1.0f, 0.9f, 0.7f),
+			glm::vec3(0.8f, 0.9f, 1.0f),
+			glm::vec3(0.4f, 1.0f, 0.5f)
+		};
+
+		glm::vec3 matDiffuse;
+#if USE_TEXTURES
+		if (fragmentBuffer[index].dev_diffuseTex != NULL) {
+			matDiffuse = texture2D(fragmentBuffer[index].texWidth, fragmentBuffer[index].texHeight,
+				fragmentBuffer[index].dev_diffuseTex, fragmentBuffer[index].texcoord0);
+			matDiffuse = glm::max(matDiffuse, glm::vec3(0.05f));
+		}
+		else {
+			matDiffuse = glm::vec3(0.75f);
+		}
+#else 
+		matDiffuse = glm::vec3(0.75f);
+#endif
+
+		// simple blinn phong
+		glm::vec3 col = glm::vec3(0);
+		glm::vec3 nor = fragmentBuffer[index].eyeNor;
+
+		for (int i = 0; i < 3; i++) {
+			glm::vec3 halfVec = glm::normalize(lightDir[i] - glm::normalize(fragmentBuffer[index].eyePos));
+
+			float lambert = glm::dot(nor, lightDir[i]);
+			lambert = lambert < 0 ? 0 : lambert;
+			float blinn = pow(glm::dot(halfVec, nor), 32.0f);
+			blinn = blinn < 0 ? 0 : blinn;
 
-		// TODO: add your fragment shader code here
+			col += lightIntensity[i] * lightCol[i] * (glm::vec3(blinn) + matDiffuse * lambert);
+		}
 
+		framebuffer[index] = col;
     }
 }
 
@@ -166,6 +397,18 @@ void rasterizeInit(int w, int h) {
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(dev_fragMutex);
+	cudaMalloc(&dev_fragMutex, width * height * sizeof(int));
+
+#if BLOOM
+	cudaFree(dev_bloom1);
+	cudaFree(dev_bloom2);
+	cudaMalloc(&dev_bloom1, width * height / 4 * sizeof(glm::vec3));
+	cudaMalloc(&dev_bloom2, width * height / 4 * sizeof(glm::vec3));
+	cudaMemset(dev_bloom1, 0, width * height / 4 * sizeof(glm::vec3));
+	cudaMemset(dev_bloom2, 0, width * height / 4 * sizeof(glm::vec3));
+#endif
+
 	checkCUDAError("rasterizeInit");
 }
 
@@ -183,6 +426,18 @@ void initDepth(int w, int h, int * depth)
 }
 
 
+__global__ 
+void initMutex(int w, int h, int * mutex) {
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (x < w && y < h)
+	{
+		int index = x + (y * w);
+		mutex[index] = 0;
+	}
+}
+
 /**
 * kern function with support for stride to sometimes replace cudaMemcpy
 * One thread is responsible for copying one component
@@ -617,8 +872,6 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 		checkCUDAError("Free BufferView Device Mem");
 	}
-
-
 }
 
 
@@ -634,14 +887,31 @@ void _vertexTransformAndAssembly(
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
 	if (vid < numVertices) {
 
-		// TODO: Apply vertex transformation here
+		glm::vec4 posIn = glm::vec4(primitive.dev_position[vid], 1.0f);
+		
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
+		glm::vec4 posTransformed = MVP * posIn;
+		// divide the pos by its w element to transform into NDC space
+		posTransformed /= posTransformed.w;
 		// Finally transform x and y to viewport space
-
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
-		
+		posTransformed.x = 0.5f * (posTransformed.x + 1.0f) * width;
+		posTransformed.y = 0.5f * (-posTransformed.y + 1.0f) * height;
+
+		primitive.dev_verticesOut[vid].pos = posTransformed; // screen position
+		primitive.dev_verticesOut[vid].eyeNor = glm::normalize(MV_normal * primitive.dev_normal[vid]);
+		primitive.dev_verticesOut[vid].eyePos = glm::vec3(MV * posIn); // view position for lighting
+
+#if USE_TEXTURES
+		if (primitive.dev_diffuseTex != NULL) {
+			primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+			primitive.dev_verticesOut[vid].texHeight = primitive.diffuseTexHeight;
+			primitive.dev_verticesOut[vid].texWidth = primitive.diffuseTexWidth;
+		}
+		else {
+			primitive.dev_verticesOut[vid].dev_diffuseTex = NULL;
+		}
+#endif
 	}
 }
 
@@ -656,24 +926,86 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
 
 	if (iid < numIndices) {
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
+	}
+}
+// parallelize rasterization by triangle
+__global__ void _rasterizeTriangle(const int numTris, const Primitive* primitives, 
+	Fragment* frags, int* depthBuffer, const int width, const int height, int * mutex) {
+
+	int idx = blockIdx.x * blockDim.x + threadIdx.x;
+	if (idx >= numTris) return;
+
+	Primitive pri = primitives[idx];
+	glm::vec3 tri[3] = { glm::vec3(pri.v[0].pos), glm::vec3(pri.v[1].pos), glm::vec3(pri.v[2].pos) };
+	glm::vec3 triNor[3] = { glm::vec3(pri.v[0].eyeNor), glm::vec3(pri.v[1].eyeNor), glm::vec3(pri.v[2].eyeNor) };
+	glm::vec3 triPos[3] = { glm::vec3(pri.v[0].eyePos), glm::vec3(pri.v[1].eyePos), glm::vec3(pri.v[2].eyePos) };
+	AABB aabb = getAABBForTriangle(tri);
+	
+	for (int y = (int) aabb.min.y; y <= (int) aabb.max.y; y++) {
+		if (y < 0 || y > height) continue;
+		for (int x = (int) aabb.min.x; x <= (int) aabb.max.x; x++) {
+			if (x < 0 || x > width) continue;
+			glm::vec2 pt = glm::vec2(x, y);
+			int pxIdx = y * width + x;
+
+			glm::vec3 bary = calculateBarycentricCoordinate(tri, pt);
+			if (!isBarycentricCoordInBounds(bary)) {
+				//frags[pxIdx].color = glm::vec3(0);
+				continue;
+			}
 
-		// TODO: uncomment the following code for a start
-		// This is primitive assembly for triangles
-
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
-
+			float zPersp = getZAtCoordinatePersp(bary, tri);
+			glm::vec3 interNor = glm::normalize(getPerspectiveInterpolatedVector(bary, triNor, tri, zPersp));
+			glm::vec3 interPos = getPerspectiveInterpolatedVector(bary, triPos, tri, zPersp);
 
-		// TODO: other primitive types (point, line)
+			int depth = (int)( getZAtCoordinate(bary, tri) * INT_MAX);
+			
+			bool isSet;
+			do {
+				isSet = (atomicCAS(&mutex[pxIdx], 0, 1) == 0);
+				if (isSet) {
+					if (depthBuffer[pxIdx] > depth) {
+						// replaced fragment with this triangle
+						frags[pxIdx].color = interNor;
+						frags[pxIdx].eyeNor = interNor;
+						frags[pxIdx].eyePos = interPos;
+						depthBuffer[pxIdx] = depth;
+
+#if USE_TEXTURES
+						if (pri.v[0].dev_diffuseTex != NULL) {
+							glm::vec3 triUV[3] = {
+								glm::vec3(pri.v[0].texcoord0, 0.f),
+								glm::vec3(pri.v[1].texcoord0, 0.f),
+								glm::vec3(pri.v[2].texcoord0, 0.f)
+							};
+							glm::vec2 interUV = glm::vec2(getPerspectiveInterpolatedVector(bary, triUV, tri, zPersp));
+
+							frags[pxIdx].dev_diffuseTex = pri.v[0].dev_diffuseTex;
+							frags[pxIdx].texcoord0 = interUV;
+							frags[pxIdx].texHeight = pri.v[0].texHeight;
+							frags[pxIdx].texWidth = pri.v[0].texWidth;
+						}
+						else {
+							frags[pxIdx].dev_diffuseTex = NULL;
+						}
+						
+#endif
+					}	
+					mutex[pxIdx] = 0;
+				}
+			} while (!isSet);
+			
+		}
 	}
-	
 }
 
-
+int ech = 0;
 
 /**
  * Perform rasterization.
@@ -684,8 +1016,7 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
 		(height - 1) / blockSize2d.y + 1);
 
-	// Execute your rasterization pipeline here
-	// (See README for rasterization pipeline outline.)
+	
 
 	// Vertex Process & primitive assembly
 	{
@@ -718,17 +1049,52 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
+
 	
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
 	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
+	checkCUDAError("init depth");
+	initMutex << < blockCount2d, blockSize2d >> > (width, height, dev_fragMutex);
+	checkCUDAError("init mutex");
 
+	const int numThreads = 128;
+	dim3 triBlockCount = (totalNumPrimitives + numThreads - 1) / numThreads;
 
+	_rasterizeTriangle << < triBlockCount, numThreads >> > (totalNumPrimitives, dev_primitives, dev_fragmentBuffer, 
+		dev_depth, width, height, dev_fragMutex);
+	checkCUDAError("rasterize tris");
 
     // Copy depthbuffer colors into framebuffer
 	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
 	checkCUDAError("fragment shader");
+
+
+
+#if BLOOM
+	// make downsampled high pass
+	dim3 blockDownsampleCount2d((width / 2 - 1) / blockSize2d.x + 1,
+		(height / 2 - 1) / blockSize2d.y + 1);
+
+	bloomHighPass << < blockDownsampleCount2d, blockSize2d >> > (width / 2, height / 2, dev_framebuffer, dev_bloom1);
+
+	// apply gaussian
+	bloomHorizontalGather << < blockDownsampleCount2d, blockSize2d >> >(width / 2, height / 2, dev_bloom1, dev_bloom2);
+	bloomVerticalGather << < blockDownsampleCount2d, blockSize2d >> >(width / 2, height / 2, dev_bloom2, dev_bloom1);
+
+#if BLOOM2PASS
+	bloomHorizontalGather << < blockDownsampleCount2d, blockSize2d >> >(width / 2, height / 2, dev_bloom1, dev_bloom2);
+	bloomVerticalGather << < blockDownsampleCount2d, blockSize2d >> >(width / 2, height / 2, dev_bloom2, dev_bloom1);
+#endif
+
+	// upsample and composite
+	bloomComposite << < blockCount2d, blockSize2d >> > (width, height, dev_framebuffer, dev_bloom1);
+
+#endif
+
+	// HDR tonemap
+	toneMap << <blockCount2d, blockSize2d >> >(width, height, dev_framebuffer, GAMMA, EXPOSURE);
+	checkCUDAError("fragment shader");
+
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
     sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
     checkCUDAError("copy render result to pbo");
@@ -772,5 +1138,15 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_fragMutex);
+	dev_fragMutex = NULL;
+
+#if BLOOM
+	cudaFree(dev_bloom1);
+	dev_bloom1 = NULL;
+	cudaFree(dev_bloom2);
+	dev_bloom2 = NULL;
+#endif
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..d9bdab0 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -95,7 +95,29 @@ bool isBarycentricCoordInBounds(const glm::vec3 barycentricCoord) {
  */
 __host__ __device__ static
 float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3]) {
-    return -(barycentricCoord.x * tri[0].z
+    return (barycentricCoord.x * tri[0].z
            + barycentricCoord.y * tri[1].z
            + barycentricCoord.z * tri[2].z);
 }
+
+// Thanks to Jin Kim and Adam Mally for explaining perspective correction
+__host__ __device__ static
+float getZAtCoordinatePersp(const glm::vec3 bary, const glm::vec3 tri[3]) {
+	return 1.0f / (bary.x / tri[0].z + bary.y / tri[1].z + bary.z / tri[2].z);
+}
+
+// get a barycentric interpolated vector with perspective correction
+__host__ __device__ static
+glm::vec3 getPerspectiveInterpolatedVector(const glm::vec3 bary, const glm::vec3 triAttr[3], const glm::vec3 tri[3], const float zCorrect) {
+	return zCorrect * glm::vec3(triAttr[0] * bary.x / tri[0].z +
+		triAttr[1] * bary.y / tri[1].z +
+		triAttr[2] * bary.z / tri[2].z);
+}
+
+__host__ __device__ static
+glm::vec3 getPerspectiveInterpolatedVector(const glm::vec3 bary, const glm::vec3 triAttr[3], const glm::vec3 tri[3]) {
+	float zCorrect = getZAtCoordinatePersp(bary, tri);
+	return zCorrect * glm::vec3(triAttr[0] * bary.x / tri[0].z +
+		triAttr[1] * bary.y / tri[1].z +
+		triAttr[2] * bary.z / tri[2].z);
+}
\ No newline at end of file