diff --git a/README.md b/README.md
index cad1abd..9980af8 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,129 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Aman Sachan
+* Tested on: Windows 10, i7-7700HQ @ 2.8GHz 32GB, GTX 1070(laptop GPU) 8074MB (Personal Machine: Customized MSI GT62VR 7RE)
+
+[![](readmeImages/CUDARasterizerVimeoLink.png)](https://vimeo.com/238849486)
+
+## Overview
+
+Rasterization (or rasterisation) as defined by wikipedia is the task of taking an image described in a vector graphics format (shapes) and converting it into a raster image (pixels or dots).
+
+In this project, I simulated the rasterization process of a GPU using CUDA kernels. The aim of this project was to learn the graphics pipeline more intimately while also gaining an appreciation for the GPU's existing rasterization capabilities. 
+
+The pipeline I implemented here is a fairly simple one. It consists of:
+	- Vertex input assembly
+	- Vertex shading 
+	- Primitive assembly
+	- Backface culling
+	- Rasterization
+	- Per fragment depth test
+	- Fragment shading
+
+![](readmeImages/pipeline.png)
+
+## Features
+
+### Tile Based Rasterizarion
+
+Tile Based Rasterization is a technique that is commonly seen on low-power devices like mobile phones; it however is gaining popularity and has been [adopted by Nvidia as the defacto rasterization technique](https://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/) since the maxwell architecture.
+
+Tiled rasterization simply cuts up the output image into a grid of 2D tiles that are then dealt with separately. As a preprocess step all the primitives in the scene are binned into different tiles using their bounding boxes. Then, during the actual rasterization stage a separate kernel is launched for each tile that deals with only those primitives that happened to be binned into that tile. And those are pretty much the only major differences that tiled rasterization introduces as compared to a regular scanline implementaion.
+
+![](readmeImages/tileOccupancy.png)
+
+![](readmeImages/TileBased_vs_ScanLine.png)
+
+Performance wise there is an almost 4X increase in the framerate when the window space triangles are distributed over most of the tiles. This is mostly because tile based is more stable in terms of performance whereas there is an exponential drop for regular scanline rasterization.
+If however, all the triangles exist inside a few tiles the technique is pretty useless. Fortunately, in real world applications triangles are pretty evenly distributed and binning them into tiles greatly increases the framerate. Performance for tile based rasterization can be simplified to the time complexity of the numberOfPixels x (numberOfPrimitives/numTiles) assuming we have a uniform distribution of triangles in window space.
+
+### ScanLine Rasterization
+
+ScanLine Rasterization is the most common rasterization technique. Scanline literally scans pixels row-wise to create a picture. This can be optimized by the use of Bounding boxes for each primitive and then only performing scanline inside the primitive. Other basic optimizations include using the line intersection testing to determine the start and end points of every row being evaluated (a triangle cant occupy the entirety of its bounding box, usually a triangle fills close to half of its bounding box). In my scanline implementation, a kernel parallelized over the number of primitives is launched, and then scanline is performed over the bounding box of each primitive.
+
+![](readmeImages/BasicPipelineFeatureComparison.png)
+
+Vertex Shading and Primitive Assembly are taking a lot of time. This might be because of the huge indirection and memory reassignment that is happening in global memory. You are taking a huge chunk of global memory and assigning it to some other giant chunk of global memory. This is not friendly data flow and could possibly be made better if the memory chunks were closer together and more coherent somehow.
+
+Anyway, because Vertex shading and primitive assembly are common to all the features, we can ignore them and just look at the rasterization and shading stages.
+
+![](readmeImages/scanlineStages_rasterizationAndShading.png)
+
+Usually Shading is incredibly expensive but because the most complex shader in use is a lambertian shader, which is simply a dot product, 2 normalizations, a multiplication, and an addition it is not complex at all. This is why the shading stage is about as expensive as the rasterization stage.
+
+Performance of scanline rasterization is pretty good although it usually cannot compete with tile based rendering. My implementation of scanline rasterization was used as a baseline to compare every feature against. However, it is about half as fast as tile based rasterization.
+Performance for scanline rasterization can be simplified to the time complexity of the numberOfPixels x numberOfPrimitives assuming we have a uniform distribution of triangles in window space.
+
+### Texture Mapping
+
+Texture Mapping is a technique that maps a 2D image onto a 3D object; it is also used to define high frequency detail on 3D models. As would be expected it texture mapping only hurts performance. However textures are still worth all the frames they use because the alternative is too computationally intensive and is a nightmare for artists. The overall impact reading textures can be reduced by streamlining data and make it cache coherent. Essentially anything that reduces global memory reads.
+
+### Perspective Correct Interpolation of values
+
+ ![](readmeImages/notPerspectiveCorrect.png)  |  ![](readmeImages/perspectiveCorrect.png) |
+|---|---|
+|  no perspective correct  |  perspective correct |
+
+Transforming positions from View space to NDC space involves perspective division. This can introduce artifacts in textures if it isnt taken into account during color interpolation for fragments. Perspective Correct Interpolation doesn't add much if anything to the computational cost of barycentric interpolation.
+
+### Bilinear Filtering
+
+ ![](readmeImages/inputbilinear.png)  |  ![](readmeImages/bilinear.png) |
+|---|---|
+|  random colors  |  bilinear filtering applied |
+
+Bilinear filtering is a texture filtering method used to smooth textures when displayed larger or smaller than they actually are, which is almost all the time. Bilinear filtering essentially bilinearly interpolates between the 4 pixels surrounding the look up point. It makes images look much much better and not pixelated. However, it does introduces slight artifacts that can be removed with better but more expensive filtering methods such as trilinear filtering.
+
+### Depth Test
+
+A depth test is used to ensure that only the fragments that can be seen by the camera are drawn (assuming there isnt transparency). In practice, this means if there are a thousand triangles in a line behind the camera then only the fragment from the first camera will be drawn or written into. In a GPU rasterizer, the depth test has to be performed atomically or with mutexes. This is because in a parallelised kernel, multiple threads can try and write to the fragment buffer at the same time at the same location. This is a "race condition." There is no guarantee as to which thread will finish writing to it last. To avoid this, I used a mutex that basically acts as a lock. A mutex array contatains a lock for every index of the fragment buffer. Atomics are operations that guarantee serial operation amongst parallel threads. Atomics however dont prevent race conditions between different blocks launched by a kernel. Thus they can produce rare race conditions that show up as a few blinking pixels.
+
+![](readmeImages/depthTest.png)
+
+Performance wise Depth Testing leads to a big hit in the framerate because we have serialized what was once a parallel section of our code.
+
+### Backface Culling
+
+In a 3D scene there are objects that will not be seen by camera because they aren't facing the camera. It is a good idea to simply ignore these triangles as they dont generally add to the final image.
+
+![](readmeImages/PipelineComparison__Breakdown.png)
+
+### Different Primitive Types
+
+Just like OpenGL, my implementation of a rasterizer allows for drawing meshes as lines and also as point clouds.
+
+![](readmeImages/RasterizationByPrimitives.png)
+
+#### Wireframe (Lines)
+
+![](readmeImages/WireframeRasterization.png)
+
+Equivalent to GL_LINES.
+
+#### Point Cloud (Points)
+
+![](readmeImages/PointRasterization.png)
+
+Equivalent to GL_POINTS.
+
+### Shading Models:
+
+Shading models are what sell the illusion that is rasterization. Great shading models can turn a scene into a believable landscape. I have implemented the standard lambertian shading model along with debug shading models.
+
+#### Lambertian
+
+![](readmeImages/Engine.png)
+
+#### Depth Shading
 
-### (TODO: Your README)
+![](readmeImages/depthTest.png)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+#### Normals
 
+![](readmeImages/NormalsCow.png)
 
 ### Credits
 
diff --git a/gltfs/checkerboard/checkerboard.gltf b/gltfs/checkerboard/checkerboard.gltf
index a333738..28bac84 100644
--- a/gltfs/checkerboard/checkerboard.gltf
+++ b/gltfs/checkerboard/checkerboard.gltf
@@ -99,7 +99,7 @@
   },
   "images": {
     "Checkered": {
-      "uri": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhwAAAIcCAIAAAAynOArAAAACXBIWXMAAAsTAAALEwEAmpwYAAALjUlEQVR42u3ZMU5jZxuGYY5tJtiWEGnRrGNSjKZmD6np2Uf2geQGiWWkQyILCQVgG5njvxhpVnAX/3GuawevnuLWp2/4559/lsvl2cTNZrO///77zz///P3336d+yx9//PHXX3+tViujGMUoRpmWi4uLxWq1OoFV5vP5zzT++++/U7/l7e1ttVqt12ujGMUoRpmW1Wo1m/oNAPz/EBUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzGI2m83n86mfcQIn/LLb7YZhMIpRjGKUyXl/fx82m81yuTyBY8ZxfH19PR6PUz/kcDhcXl6en58bxShGMcq0fHx8DCewx0/39/c3NzfjOE79kO12e3d39/j4aBSjGMUo07Lf7xdXV1cvLy8nsMrxeBzH8fPz8wRuGYYTib1RjGKU/9Qo4zj6qAcgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAmcW3b9/e3t6mfsZutzscDrvd7ng8Tv2W7Xb79evX79+/G8UoRjHKtOz3++H5+Xm5XE79kmEYnp6eNpvNbDb5t9f19fXt7e3FxYVRjGIUo0zLly9fFuv1erVaTf2S+Xy+WCweHh6mfsjZ2dmPHz+Wy+V6vTaKUYxilGlZrVb+VADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZjGbzebz+dTPOIETftntdsMwGMUoRjHK5Ly/vw+bzWa5XJ7AMeM4vr6+Ho/HqR9yOBwuLy/Pz8+NYhSjGGVaPj4+hhPY46f7+/ubm5txHKd+yHa7vbu7e3x8NIpRjGKUadnv94urq6uXl5cTWOV4PI7j+Pn5eQK3DMOJxN4oRjHKf2qUcRx91AOQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAzOLbt29vb29TP2O32x0Oh91udzwep37Ldrv9+vXr9+/fjWIUoxhlWvb7/fD8/LxcLqd+yTAMT09Pm81mNpv82+v6+vr29vbi4sIoRjGKUably5cvi/V6vVqtpn7JfD5fLBYPDw9TP+Ts7OzHjx/L5XK9XhvFKEYxyrSsVit/KgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs5jNZvP5fOpnnMAJv+x2u2EYjGIUoxhlct7f34fNZrNcLk/gmHEcX19fj8fj1A85HA6Xl5fn5+dGMYpRjDItHx8fwwns8dP9/f3Nzc04jlM/ZLvd3t3dPT4+GsUoRjHKtOz3+8XV1dXLy8sJrHI8Hsdx/Pz8PIFbhuFEYm8UoxjlPzXKOI4+6gHIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJAZvHt27e3t7epn7Hb7Q6Hw263Ox6PU79lu91+/fr1+/fvRjGKUYwyLfv9fnh+fl4ul1O/ZBiGp6enzWYzm03+7XV9fX17e3txcWEUoxjFKNPy5cuXxXq9Xq1WU79kPp8vFouHh4epH3J2dvbjx4/lcrler41iFKMYZVpWq5U/FQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQWcxms/l8PvUzTuCEX3a73TAMRjGKUYwyOe/v78Nms1kulydwzDiOr6+vx+Nx6occDofLy8vz83OjGMUoRpmWj4+P4QT2+On+/v7m5mYcx6kfst1u7+7uHh8fjWIUoxhlWvb7/eLq6url5eUEVjkej+M4fn5+nsAtw3AisTeKUYzynxplHEcf9QBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgIyoAZEQFgIyoAJARFQAyogJARlQAyIgKABlRASAjKgBkRAWAjKgAkBEVADKiAkBGVADIiAoAGVEBICMqAGREBYCMqACQERUAMqICQEZUAMiICgAZUQEgs7i4uFgul1M/Y7vdfnx8fHx8jOM49Vv2+/1vv/1mFKMYxShTHOV/51nv6/az2UUAAAAASUVORK5CYII="
+      "uri": "checkerboard.jpg"
     }
   },
   "materials": {
diff --git a/gltfs/checkerboard/checkerboard.jpg b/gltfs/checkerboard/checkerboard.jpg
new file mode 100644
index 0000000..64753c4
Binary files /dev/null and b/gltfs/checkerboard/checkerboard.jpg differ
diff --git a/readmeImages/BasicPipelineFeatureComparison.png b/readmeImages/BasicPipelineFeatureComparison.png
new file mode 100644
index 0000000..1eb24b6
Binary files /dev/null and b/readmeImages/BasicPipelineFeatureComparison.png differ
diff --git a/readmeImages/CUDARasterizerVimeoLink.png b/readmeImages/CUDARasterizerVimeoLink.png
new file mode 100644
index 0000000..8a47312
Binary files /dev/null and b/readmeImages/CUDARasterizerVimeoLink.png differ
diff --git a/readmeImages/Engine.png b/readmeImages/Engine.png
new file mode 100644
index 0000000..aa5207b
Binary files /dev/null and b/readmeImages/Engine.png differ
diff --git a/readmeImages/NormalsCow.png b/readmeImages/NormalsCow.png
new file mode 100644
index 0000000..e4f5dc0
Binary files /dev/null and b/readmeImages/NormalsCow.png differ
diff --git a/readmeImages/PipelineComparison.png b/readmeImages/PipelineComparison.png
new file mode 100644
index 0000000..deca42f
Binary files /dev/null and b/readmeImages/PipelineComparison.png differ
diff --git a/readmeImages/PipelineComparison__Breakdown.png b/readmeImages/PipelineComparison__Breakdown.png
new file mode 100644
index 0000000..2d6b56f
Binary files /dev/null and b/readmeImages/PipelineComparison__Breakdown.png differ
diff --git a/readmeImages/PipelineComparison__BreakdownData.png b/readmeImages/PipelineComparison__BreakdownData.png
new file mode 100644
index 0000000..747d039
Binary files /dev/null and b/readmeImages/PipelineComparison__BreakdownData.png differ
diff --git a/readmeImages/PointRasterization.png b/readmeImages/PointRasterization.png
new file mode 100644
index 0000000..e1aa07c
Binary files /dev/null and b/readmeImages/PointRasterization.png differ
diff --git a/readmeImages/RasterizationByPrimitives.png b/readmeImages/RasterizationByPrimitives.png
new file mode 100644
index 0000000..1fc7536
Binary files /dev/null and b/readmeImages/RasterizationByPrimitives.png differ
diff --git a/readmeImages/TileBased_vs_ScanLine.png b/readmeImages/TileBased_vs_ScanLine.png
new file mode 100644
index 0000000..f9cc420
Binary files /dev/null and b/readmeImages/TileBased_vs_ScanLine.png differ
diff --git a/readmeImages/WireframeRasterization.png b/readmeImages/WireframeRasterization.png
new file mode 100644
index 0000000..0853151
Binary files /dev/null and b/readmeImages/WireframeRasterization.png differ
diff --git a/readmeImages/bilinear.png b/readmeImages/bilinear.png
new file mode 100644
index 0000000..9fee092
Binary files /dev/null and b/readmeImages/bilinear.png differ
diff --git a/readmeImages/depthTest.png b/readmeImages/depthTest.png
new file mode 100644
index 0000000..d5c7304
Binary files /dev/null and b/readmeImages/depthTest.png differ
diff --git a/readmeImages/grid_interpolation2d.png b/readmeImages/grid_interpolation2d.png
new file mode 100644
index 0000000..9e70ff5
Binary files /dev/null and b/readmeImages/grid_interpolation2d.png differ
diff --git a/readmeImages/inputbilinear.png b/readmeImages/inputbilinear.png
new file mode 100644
index 0000000..075376d
Binary files /dev/null and b/readmeImages/inputbilinear.png differ
diff --git a/readmeImages/notPerspectiveCorrect.png b/readmeImages/notPerspectiveCorrect.png
new file mode 100644
index 0000000..816c4ad
Binary files /dev/null and b/readmeImages/notPerspectiveCorrect.png differ
diff --git a/readmeImages/perspectiveCorrect.png b/readmeImages/perspectiveCorrect.png
new file mode 100644
index 0000000..ec3d16d
Binary files /dev/null and b/readmeImages/perspectiveCorrect.png differ
diff --git a/readmeImages/pipeline.png b/readmeImages/pipeline.png
new file mode 100644
index 0000000..7cd7876
Binary files /dev/null and b/readmeImages/pipeline.png differ
diff --git a/readmeImages/scanlineStages_rasterizationAndShading.png b/readmeImages/scanlineStages_rasterizationAndShading.png
new file mode 100644
index 0000000..76b0127
Binary files /dev/null and b/readmeImages/scanlineStages_rasterizationAndShading.png differ
diff --git a/readmeImages/tileOccupancy.png b/readmeImages/tileOccupancy.png
new file mode 100644
index 0000000..91c1ed9
Binary files /dev/null and b/readmeImages/tileOccupancy.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..4ead402 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -6,8 +6,6 @@
  * @copyright University of Pennsylvania
  */
 
-
-
 #include "main.hpp"
 
 #define STB_IMAGE_IMPLEMENTATION
@@ -18,8 +16,10 @@
 //-------------MAIN--------------
 //-------------------------------
 
-int main(int argc, char **argv) {
-    if (argc != 2) {
+int main(int argc, char **argv) 
+{
+    if (argc != 2) 
+	{
         cout << "Usage: [gltf file]. Press Enter to exit" << endl;
 		getchar();
         return 0;
@@ -32,19 +32,24 @@ int main(int argc, char **argv) {
 	std::string ext = getFilePathExtension(input_filename);
 
 	bool ret = false;
-	if (ext.compare("glb") == 0) {
+	if (ext.compare("glb") == 0) 
+	{
 		// assume binary glTF.
 		ret = loader.LoadBinaryFromFile(&scene, &err, input_filename.c_str());
-	} else {
+	} 
+	else 
+	{
 		// assume ascii glTF.
 		ret = loader.LoadASCIIFromFile(&scene, &err, input_filename.c_str());
 	}
 
-	if (!err.empty()) {
+	if (!err.empty()) 
+	{
 		printf("Err: %s\n", err.c_str());
 	}
 
-	if (!ret) {
+	if (!ret) 
+	{
 		printf("Failed to parse glTF\n");
 		return -1;
 	}
@@ -55,7 +60,8 @@ int main(int argc, char **argv) {
     fpstracker = 0;
 
     // Launch CUDA/GL
-    if (init(scene)) {
+    if (init(scene)) 
+	{
         // GLFW main loop
         mainLoop();
     }
@@ -63,15 +69,17 @@ int main(int argc, char **argv) {
     return 0;
 }
 
-void mainLoop() {
-    while (!glfwWindowShouldClose(window)) {
+void mainLoop() 
+{
+    while (!glfwWindowShouldClose(window)) 
+	{
         glfwPollEvents();
         runCuda();
 
         time_t seconds2 = time (NULL);
 
-        if (seconds2 - seconds >= 1) {
-
+        if (seconds2 - seconds >= 1) 
+		{
             fps = fpstracker / (seconds2 - seconds);
             fpstracker = 0;
             seconds = seconds2;
@@ -97,23 +105,23 @@ void mainLoop() {
 //---------RUNTIME STUFF---------
 //-------------------------------
 float scale = 1.0f;
-float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
+float x_trans = 0.0f, y_trans = 0.0f, z_trans = -30.0f;
 float x_angle = 0.0f, y_angle = 0.0f;
-void runCuda() {
+void runCuda() 
+{
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
     dptr = NULL;
 
 	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
-		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+									  scale * ((float)width / (float)height),
+									  -scale, scale, 1.0, 1000.0);
 
 	glm::mat4 V = glm::mat4(1.0f);
 
-	glm::mat4 M =
-		glm::translate(glm::vec3(x_trans, y_trans, z_trans))
-		* glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
-		* glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
+	glm::mat4 M = glm::translate(glm::vec3(x_trans, y_trans, z_trans))
+				  * glm::rotate(x_angle, glm::vec3(1.0f, 0.0f, 0.0f))
+				  * glm::rotate(y_angle, glm::vec3(0.0f, 1.0f, 0.0f));
 
 	glm::mat3 MV_normal = glm::transpose(glm::inverse(glm::mat3(V) * glm::mat3(M)));
 	glm::mat4 MV = V * M;
@@ -131,17 +139,20 @@ void runCuda() {
 //----------SETUP STUFF----------
 //-------------------------------
 
-bool init(const tinygltf::Scene & scene) {
+bool init(const tinygltf::Scene & scene) 
+{
     glfwSetErrorCallback(errorCallback);
 
-    if (!glfwInit()) {
+    if (!glfwInit()) 
+	{
         return false;
     }
 
     width = 800;
     height = 800;
     window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
-    if (!window) {
+    if (!window) 
+	{
         glfwTerminate();
         return false;
     }
@@ -150,7 +161,8 @@ bool init(const tinygltf::Scene & scene) {
 
     // Set up GL context
     glewExperimental = GL_TRUE;
-    if (glewInit() != GLEW_OK) {
+    if (glewInit() != GLEW_OK) 
+	{
         return false;
     }
 
@@ -171,16 +183,16 @@ bool init(const tinygltf::Scene & scene) {
 		std::map<std::string, std::vector<std::string> >::const_iterator itEnd(
 			scene.scenes.end());
 
-		for (; it != itEnd; it++) {
-			for (size_t i = 0; i < it->second.size(); i++) {
-				std::cout << it->second[i]
-					<< ((i != (it->second.size() - 1)) ? ", " : "");
+		for (; it != itEnd; it++) 
+		{
+			for (size_t i = 0; i < it->second.size(); i++) 
+			{
+				std::cout << it->second[i] << ((i != (it->second.size() - 1)) ? ", " : "");
 			}
 			std::cout << " ] " << std::endl;
 		}
 	}
 
-
 	rasterizeSetBuffers(scene);
 
     GLuint passthroughProgram;
@@ -192,7 +204,8 @@ bool init(const tinygltf::Scene & scene) {
     return true;
 }
 
-void initPBO() {
+void initPBO() 
+{
     // set up vertex data parameter
     int num_texels = width * height;
     int num_values = num_texels * 4;
@@ -207,20 +220,20 @@ void initPBO() {
     // Allocate data for the buffer. 4-channel 8-bit image
     glBufferData(GL_PIXEL_UNPACK_BUFFER, size_tex_data, NULL, GL_DYNAMIC_COPY);
     cudaGLRegisterBufferObject(pbo);
-
 }
 
-void initCuda() {
+void initCuda() 
+{
     // Use device with highest Gflops/s
     cudaGLSetGLDevice(0);
-
     rasterizeInit(width, height);
 
     // Clean up on program exit
     atexit(cleanupCuda);
 }
 
-void initTextures() {
+void initTextures() 
+{
     glGenTextures(1, &displayImage);
     glBindTexture(GL_TEXTURE_2D, displayImage);
     glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
@@ -229,7 +242,8 @@ void initTextures() {
                   GL_UNSIGNED_BYTE, NULL);
 }
 
-void initVAO(void) {
+void initVAO(void) 
+{
     GLfloat vertices[] = {
         -1.0f, -1.0f,
         1.0f, -1.0f,
@@ -264,13 +278,15 @@ void initVAO(void) {
 }
 
 
-GLuint initShader() {
+GLuint initShader() 
+{
     const char *attribLocations[] = { "Position", "Tex" };
     GLuint program = glslUtility::createDefaultProgram(attribLocations, 2);
     GLint location;
 
     glUseProgram(program);
-    if ((location = glGetUniformLocation(program, "u_image")) != -1) {
+    if ((location = glGetUniformLocation(program, "u_image")) != -1) 
+	{
         glUniform1i(location, 0);
     }
 
@@ -280,18 +296,22 @@ GLuint initShader() {
 //-------------------------------
 //---------CLEANUP STUFF---------
 //-------------------------------
-
-void cleanupCuda() {
-    if (pbo) {
+void cleanupCuda() 
+{
+    if (pbo) 
+	{
         deletePBO(&pbo);
     }
-    if (displayImage) {
+    if (displayImage) 
+	{
         deleteTexture(&displayImage);
     }
 }
 
-void deletePBO(GLuint *pbo) {
-    if (pbo) {
+void deletePBO(GLuint *pbo) 
+{
+    if (pbo) 
+	{
         // unregister this buffer object with CUDA
         cudaGLUnregisterBufferObject(*pbo);
 
@@ -302,12 +322,14 @@ void deletePBO(GLuint *pbo) {
     }
 }
 
-void deleteTexture(GLuint *tex) {
+void deleteTexture(GLuint *tex) 
+{
     glDeleteTextures(1, tex);
     *tex = (GLuint)NULL;
 }
 
-void shut_down(int return_code) {
+void shut_down(int return_code) 
+{
     rasterizeFree();
     cudaDeviceReset();
 #ifdef __APPLE__
@@ -319,13 +341,15 @@ void shut_down(int return_code) {
 //------------------------------
 //-------GLFW CALLBACKS---------
 //------------------------------
-
-void errorCallback(int error, const char *description) {
+void errorCallback(int error, const char *description) 
+{
     fputs(description, stderr);
 }
 
-void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
+void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods) 
+{
+    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) 
+	{
         glfwSetWindowShouldClose(window, GL_TRUE);
     }
 }
@@ -333,7 +357,8 @@ void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods
 //----------------------------
 //----- util -----------------
 //----------------------------
-static std::string getFilePathExtension(const std::string &FileName) {
+static std::string getFilePathExtension(const std::string &FileName) 
+{
 	if (FileName.find_last_of(".") != std::string::npos)
 		return FileName.substr(FileName.find_last_of(".") + 1);
 	return "";
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..5ea8f57 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -11,15 +11,50 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include <thrust/random.h>
+#include <thrust/execution_policy.h>
 #include <util/checkCUDAError.h>
 #include <util/tiny_gltf_loader.h>
 #include "rasterizeTools.h"
 #include "rasterize.h"
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
-
-namespace {
-
+#include <chrono>
+
+static const int DEPTHSCALE = INT_MAX;
+static const int numTilesX = 32;
+static const int numTilesY = 32;
+static const int maxNumTiles = (numTilesX + 1)*(numTilesY + 1);
+
+//--------------------
+//Toggle-able OPTIONS
+//--------------------
+// only use tilebased or scanline not both
+//Tile Based Rasterization -- only does triangular rasterization
+#define TILEBASED 0
+	#define DISPLAY_TILES 0
+
+//Scanline Rasterization
+#define SCANLINE 1
+	#define RASTERIZE_TRIANGLES 1;
+	#define RASTERIZE_LINES 0;
+	#define RASTERIZE_POINTS 0;
+
+//Shading stuff handled in the render function
+#define DISPLAY_DEPTH 0
+#define DISPLAY_NORMAL 0
+#define DISPLAY_ABSNORMAL 0
+#define FRAG_SHADING_LAMBERT 1
+
+//texture stuff
+#define TEXTURE_MAPPING 1
+#define BILINEAR_FILTERING 1
+
+//Depth Testing and Culling
+#define DEPTH_TEST 1
+#define BACKFACE_CULLING 0
+
+namespace 
+{
 	typedef unsigned short VertexIndex;
 	typedef glm::vec3 VertexAttributePosition;
 	typedef glm::vec3 VertexAttributeNormal;
@@ -28,48 +63,49 @@ namespace {
 
 	typedef unsigned char BufferByte;
 
-	enum PrimitiveType{
+	enum PrimitiveType
+	{
 		Point = 1,
 		Line = 2,
 		Triangle = 3
 	};
 
-	struct VertexOut {
-		glm::vec4 pos;
-
-		// TODO: add new attributes to your VertexOut
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		 glm::vec3 eyePos;	// eye space position used for shading
-		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
-		 glm::vec2 texcoord0;
-		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
-		// ...
+	struct VertexOut 
+	{
+		glm::vec4 vPos;
+		glm::vec3 vEyePos;	// eye space position used for shading
+		glm::vec3 vNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
+		glm::vec3 vColor;
+		glm::vec2 texcoord0;
+		TextureData* dev_diffuseTex = NULL;
+		int texWidth, texHeight;
 	};
 
-	struct Primitive {
+	struct Tile {
+		int triIndices[1000]; //indices of the triangles that each pixel in the tile has to check
+							 //limit to 500 triangles in a tile
+	};
+	
+	struct Primitive 
+	{
 		PrimitiveType primitiveType = Triangle;	// C++ 11 init
 		VertexOut v[3];
+		bool tileBuckets[maxNumTiles];
+		bool cull;
 	};
 
-	struct Fragment {
-		glm::vec3 color;
-
-		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
+	struct Fragment 
+	{
+		glm::vec3 fColor;
+		glm::vec3 fEyePos;	// eye space position used for shading
+		glm::vec3 fNor;
+		float depth;
+		VertexAttributeTexcoord texcoord0;
+		TextureData* dev_diffuseTex;
 	};
 
-	struct PrimitiveDevBufPointers {
+	struct PrimitiveDevBufPointers 
+	{
 		int primitiveMode;	//from tinygltfloader macro
 		PrimitiveType primitiveType;
 		int numPrimitives;
@@ -95,65 +131,121 @@ namespace {
 
 		// TODO: add more attributes when needed
 	};
-
 }
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
-
 static int width = 0;
 static int height = 0;
 
 static int totalNumPrimitives = 0;
+static int numActivePrimitives = 0;
+
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static Tile *dev_tiles = NULL;
+static int* dev_tileTriCount = NULL; //how many triangles have actually filled the list
+static int* dev_tilemutex = NULL;
 
-/**
- * Kernel that writes the image to the OpenGL PBO directly.
- */
+static int * dev_depth = NULL; //depth buffer
+static int * dev_mutex = NULL; //mutex buffer for depth
+
+//------------------------------------------------
+//-------------------Timer------------------------
+using time_point_t = std::chrono::high_resolution_clock::time_point;
+time_point_t timeStartCpu;
+time_point_t timeEndCpu;
+float prevElapsedTime = 0.0f;
+//------------------------------------------------
+
+// Kernel that writes the image to the OpenGL PBO directly.
 __global__ 
-void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
+void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) 
+{
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
-    if (x < w && y < h) {
-        glm::vec3 color;
-        color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
-        color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
-        color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+    if (x < w && y < h) 
+	{
+        glm::vec3 fcolor;
+        fcolor.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
+        fcolor.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
+        fcolor.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
         // Each thread writes one pixel location in the texture (textel)
         pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
+        pbo[index].x = fcolor.x;
+        pbo[index].y = fcolor.y;
+        pbo[index].z = fcolor.z;
     }
 }
 
-/** 
-* Writes fragment colors to the framebuffer
-*/
-__global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
+__host__ __device__ 
+glm::vec3 LambertFragShader(glm::vec3 pos, glm::vec3 color, glm::vec3 normal)
+{
+	glm::vec3 lightPosition = glm::vec3(1.0f);
+	glm::vec3 finalColor = color*glm::dot(normal, glm::normalize(lightPosition - pos));
+	return finalColor;
+}
+
+//Writes fragment colors to the framebuffer
+__global__ 
+void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) 
+{
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
-
-		// TODO: add your fragment shader code here
-
+    if (x < w && y < h) 
+	{
+		#if SCANLINE
+			#if RASTERIZE_TRIANGLES
+				#if DISPLAY_DEPTH
+						framebuffer[index] = glm::vec3(fragmentBuffer[index].depth);
+				#elif DISPLAY_NORMAL
+						framebuffer[index] = fragmentBuffer[index].fNor;
+				#elif DISPLAY_ABSNORMAL
+						framebuffer[index] = glm::abs(fragmentBuffer[index].fNor);
+				#elif FRAG_SHADING_LAMBERT
+						framebuffer[index] = LambertFragShader(fragmentBuffer[index].fEyePos,
+															   fragmentBuffer[index].fColor,
+															   fragmentBuffer[index].fNor);
+				#else
+						framebuffer[index] = fragmentBuffer[index].fColor + 0.15f;
+				#endif
+			#endif
+			#if RASTERIZE_LINES
+					framebuffer[index] = fragmentBuffer[index].fColor + 0.15f;
+			#endif
+			#if RASTERIZE_POINTS
+					framebuffer[index] = fragmentBuffer[index].fColor + 0.15f;
+			#endif
+		#endif
+		#if TILEBASED
+			#if DISPLAY_TILES
+					framebuffer[index] = fragmentBuffer[index].fColor;
+			#elif DISPLAY_DEPTH
+								framebuffer[index] = glm::vec3(fragmentBuffer[index].depth);
+			#elif DISPLAY_NORMAL
+								framebuffer[index] = fragmentBuffer[index].fNor;
+			#elif DISPLAY_ABSNORMAL
+								framebuffer[index] = glm::abs(fragmentBuffer[index].fNor);
+			#elif FRAG_SHADING_LAMBERT
+								framebuffer[index] = LambertFragShader(fragmentBuffer[index].fEyePos,
+									fragmentBuffer[index].fColor,
+									fragmentBuffer[index].fNor);
+			#else
+					framebuffer[index] = fragmentBuffer[index].fColor + 0.15f;
+			#endif
+		#endif
     }
 }
 
-/**
- * Called once at the beginning of the program to allocate memory.
- */
-void rasterizeInit(int w, int h) {
+//Called once at the beginning of the program to allocate memory.
+void rasterizeInit(int w, int h) 
+{
     width = w;
     height = h;
 	cudaFree(dev_fragmentBuffer);
@@ -163,13 +255,26 @@ void rasterizeInit(int w, int h) {
     cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
     
+	cudaFree(dev_tiles);
+	cudaMalloc(&dev_tiles, maxNumTiles * sizeof(Tile));
+	cudaMemset(dev_tiles, 0, maxNumTiles * sizeof(Tile));
+
+	cudaFree(dev_tileTriCount);
+	cudaMalloc(&dev_tileTriCount, maxNumTiles * sizeof(int));
+
+	cudaFree(dev_tilemutex);
+	cudaMalloc(&dev_tilemutex, maxNumTiles * sizeof(int));
+
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(dev_mutex);
+	cudaMalloc(&dev_mutex, width * height * sizeof(int));
+	
 	checkCUDAError("rasterizeInit");
 }
 
-__global__
+__global__ 
 void initDepth(int w, int h, int * depth)
 {
 	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -182,14 +287,22 @@ void initDepth(int w, int h, int * depth)
 	}
 }
 
+__global__
+void initCullValue(int numPrimitives, Primitive * dev_primitive)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < numPrimitives)
+	{
+		dev_primitive[index].cull = false;
+	}
+}
 
-/**
-* kern function with support for stride to sometimes replace cudaMemcpy
-* One thread is responsible for copying one component
-*/
+//kern function with support for stride to sometimes replace cudaMemcpy
+//One thread is responsible for copying one component
 __global__ 
-void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) {
-	
+void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, int n, int byteStride, int byteOffset, int componentTypeByteSize) 
+{
 	// Attribute (vec3 position)
 	// component (3 * float)
 	// byte (4 * byte)
@@ -197,12 +310,13 @@ void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, in
 	// id of component
 	int i = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (i < N) {
+	if (i < N) 
+	{
 		int count = i / n;
 		int offset = i - count * n;	// which component of the attribute
 
-		for (int j = 0; j < componentTypeByteSize; j++) {
-			
+		for (int j = 0; j < componentTypeByteSize; j++) 
+		{
 			dev_dst[count * componentTypeByteSize * n 
 				+ offset * componentTypeByteSize 
 				+ j]
@@ -215,48 +329,51 @@ void _deviceBufferCopy(int N, BufferByte* dev_dst, const BufferByte* dev_src, in
 				+ j];
 		}
 	}
-	
-
 }
 
 __global__
-void _nodeMatrixTransform(
-	int numVertices,
-	VertexAttributePosition* position,
-	VertexAttributeNormal* normal,
-	glm::mat4 MV, glm::mat3 MV_normal) {
-
+void _nodeMatrixTransform( int numVertices,
+						   VertexAttributePosition* position,
+						   VertexAttributeNormal* normal,
+						   glm::mat4 MV, glm::mat3 MV_normal) 
+{
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
+	if (vid < numVertices) 
+	{
 		position[vid] = glm::vec3(MV * glm::vec4(position[vid], 1.0f));
 		normal[vid] = glm::normalize(MV_normal * normal[vid]);
 	}
 }
 
-glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
-	
+glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) 
+{
 	glm::mat4 curMatrix(1.0);
 
 	const std::vector<double> &m = n.matrix;
-	if (m.size() > 0) {
+	if (m.size() > 0) 
+	{
 		// matrix, copy it
-
-		for (int i = 0; i < 4; i++) {
-			for (int j = 0; j < 4; j++) {
+		for (int i = 0; i < 4; i++) 
+		{
+			for (int j = 0; j < 4; j++) 
+			{
 				curMatrix[i][j] = (float)m.at(4 * i + j);
 			}
 		}
-	} else {
+	} 
+	else 
+	{
 		// no matrix, use rotation, scale, translation
-
-		if (n.translation.size() > 0) {
+		if (n.translation.size() > 0) 
+		{
 			curMatrix[3][0] = n.translation[0];
 			curMatrix[3][1] = n.translation[1];
 			curMatrix[3][2] = n.translation[2];
 		}
 
-		if (n.rotation.size() > 0) {
+		if (n.rotation.size() > 0) 
+		{
 			glm::mat4 R;
 			glm::quat q;
 			q[0] = n.rotation[0];
@@ -267,7 +384,8 @@ glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
 			curMatrix = curMatrix * R;
 		}
 
-		if (n.scale.size() > 0) {
+		if (n.scale.size() > 0) 
+		{
 			curMatrix = curMatrix * glm::scale(glm::vec3(n.scale[0], n.scale[1], n.scale[2]));
 		}
 	}
@@ -275,12 +393,10 @@ glm::mat4 getMatrixFromNodeMatrixVector(const tinygltf::Node & n) {
 	return curMatrix;
 }
 
-void traverseNode (
-	std::map<std::string, glm::mat4> & n2m,
-	const tinygltf::Scene & scene,
-	const std::string & nodeString,
-	const glm::mat4 & parentMatrix
-	) 
+void traverseNode (	std::map<std::string, glm::mat4> & n2m,
+					const tinygltf::Scene & scene,
+					const std::string & nodeString,
+					const glm::mat4 & parentMatrix ) 
 {
 	const tinygltf::Node & n = scene.nodes.at(nodeString);
 	glm::mat4 M = parentMatrix * getMatrixFromNodeMatrixVector(n);
@@ -289,13 +405,14 @@ void traverseNode (
 	auto it = n.children.begin();
 	auto itEnd = n.children.end();
 
-	for (; it != itEnd; ++it) {
+	for (; it != itEnd; ++it) 
+	{
 		traverseNode(n2m, scene, *it, M);
 	}
 }
 
-void rasterizeSetBuffers(const tinygltf::Scene & scene) {
-
+void rasterizeSetBuffers(const tinygltf::Scene & scene) 
+{
 	totalNumPrimitives = 0;
 
 	std::map<std::string, BufferByte*> bufferViewDevPointers;
@@ -307,10 +424,12 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 		std::map<std::string, tinygltf::BufferView>::const_iterator itEnd(
 			scene.bufferViews.end());
 
-		for (; it != itEnd; it++) {
+		for (; it != itEnd; it++) 
+		{
 			const std::string key = it->first;
 			const tinygltf::BufferView &bufferView = it->second;
-			if (bufferView.target == 0) {
+			if (bufferView.target == 0) 
+			{
 				continue; // Unsupported bufferView.
 			}
 
@@ -323,18 +442,14 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 			checkCUDAError("Set BufferView Device Mem");
 
 			bufferViewDevPointers.insert(std::make_pair(key, dev_bufferView));
-
 		}
 	}
 
-
-
 	// 2. for each mesh: 
 	//		for each primitive: 
 	//			build device buffer of indices, materail, and each attributes
 	//			and store these pointers in a map
 	{
-
 		std::map<std::string, glm::mat4> nodeString2Matrix;
 		auto rootNodeNamesList = scene.scenes.at(scene.defaultScene);
 
@@ -346,7 +461,6 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 			}
 		}
 
-
 		// parse through node to access mesh
 
 		auto itNode = nodeString2Matrix.begin();
@@ -395,14 +509,13 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 					dim3 numThreadsPerBlock(128);
 					dim3 numBlocks((numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 					cudaMalloc(&dev_indices, byteLength);
-					_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-						numIndices,
-						(BufferByte*)dev_indices,
-						dev_bufferView,
-						n,
-						indexAccessor.byteStride,
-						indexAccessor.byteOffset,
-						componentTypeByteSize);
+					_deviceBufferCopy <<<numBlocks, numThreadsPerBlock>>> ( numIndices,
+																			(BufferByte*)dev_indices,
+																			dev_bufferView,
+																			n,
+																			indexAccessor.byteStride,
+																			indexAccessor.byteOffset,
+																			componentTypeByteSize );
 
 
 					checkCUDAError("Set Index Buffer");
@@ -451,7 +564,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 					int numVertices = 0;
 					// for each attribute
-					for (; it != itEnd; it++) {
+					for (; it != itEnd; it++) 
+					{
 						const tinygltf::Accessor &accessor = scene.accessors.at(it->second);
 						const tinygltf::BufferView &bufferView = scene.bufferViews.at(accessor.bufferView);
 
@@ -498,14 +612,13 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 						int byteLength = numVertices * n * componentTypeByteSize;
 						cudaMalloc(dev_attribute, byteLength);
 
-						_deviceBufferCopy << <numBlocks, numThreadsPerBlock >> > (
-							n * numVertices,
-							*dev_attribute,
-							dev_bufferView,
-							n,
-							accessor.byteStride,
-							accessor.byteOffset,
-							componentTypeByteSize);
+						_deviceBufferCopy <<<numBlocks, numThreadsPerBlock>>> ( n * numVertices,
+																				*dev_attribute,
+																				dev_bufferView,
+																				n,
+																				accessor.byteStride,
+																				accessor.byteOffset,
+																				componentTypeByteSize);
 
 						std::string msg = "Set Attribute Buffer: " + it->first;
 						checkCUDAError(msg.c_str());
@@ -597,7 +710,8 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 	}
 	
 
-	// 3. Malloc for dev_primitives
+	// 3. Malloc for dev_primitives and dev_tiles(do it here instead of 
+	//memory management on a per frame basis)
 	{
 		cudaMalloc(&dev_primitives, totalNumPrimitives * sizeof(Primitive));
 	}
@@ -617,77 +731,589 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 		checkCUDAError("Free BufferView Device Mem");
 	}
-
-
 }
 
-
-
 __global__ 
-void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
-
+void _vertexTransformAndAssembly( int numVertices, PrimitiveDevBufPointers primitive, 
+								  glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
+								  int width, int height ) 
+{
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
-
-		// TODO: Apply vertex transformation here
+	if (vid < numVertices) 
+	{
+		//---------------------------------------------------
+		//-------------- Vertex Transformation --------------
+		//---------------------------------------------------
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
-
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
+		glm::vec4 vPos = glm::vec4(primitive.dev_position[vid], 1.0f);
+		glm::vec4 eyePos = MV*vPos;
+		vPos = MVP*vPos; //now things are in clip space
+		vPos /= vPos.w; //now things are in NDC space
+		vPos.x = (vPos.x + 1.0f)*float(width)*0.5f;
+		vPos.y = (1.0f - vPos.y)*float(height)*0.5f; //now in pixel space or window coordinates
 		
+		vPos.z = (vPos.z+1.0f)*0.5f; // to convert z from a 1 to -1 range to a 0 to 1 range
+
+		glm::vec3 vNor = primitive.dev_normal[vid];
+		vNor = glm::normalize(MV_normal*vNor);
+		//---------------------------------------------------
+		//-------------- Vertex assembly --------------------
+		//---------------------------------------------------
+		// Assemble all attribute arrays into the primitive array
+		primitive.dev_verticesOut[vid].vPos = vPos;
+		primitive.dev_verticesOut[vid].vNor = vNor;
+		primitive.dev_verticesOut[vid].vEyePos = glm::vec3(eyePos);
+		primitive.dev_verticesOut[vid].vColor = glm::vec3(0,1,0);
+
+		// Texture Mapping
+		if (primitive.dev_diffuseTex == NULL) 
+		{
+			primitive.dev_verticesOut[vid].dev_diffuseTex = NULL;
+		}
+		else 
+		{
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+			primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+			primitive.dev_verticesOut[vid].texWidth = primitive.diffuseTexWidth;
+			primitive.dev_verticesOut[vid].texHeight = primitive.diffuseTexHeight;
+		}
 	}
 }
 
-
-
 static int curPrimitiveBeginId = 0;
 
 __global__ 
-void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
-
+void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, 
+						Primitive* dev_primitives, PrimitiveDevBufPointers primitive) 
+{
 	// index id
 	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (iid < numIndices) {
-
-		// TODO: uncomment the following code for a start
+	if (iid < numIndices) 
+	{
 		// This is primitive assembly for triangles
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) 
+		{
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		// TODO: other primitive types (point, line)
+	}
+}
 
+__host__ __device__ static
+glm::vec3 getTextureColorAt(const TextureData* texture, const int& textureWidth, int& u, int& v)
+{
+	int flatIndex = (u + v * textureWidth) * 3;
+	float r = (float)texture[flatIndex] / 255.0f; //flatIndex * 3 --> because 3 color channels
+	float g = (float)texture[flatIndex + 1] / 255.0f;
+	float b = (float)texture[flatIndex + 2] / 255.0f;
+	return glm::vec3(r, g, b);
+}
 
-		// TODO: other primitive types (point, line)
+__host__ __device__ static
+glm::vec3 getBilinearFilteredColor(const TextureData* tex,
+								   const int &texWidth, const int &texHeight,
+								   const float &u, const float &v)
+{
+	//references: 
+	//https://en.wikipedia.org/wiki/Bilinear_filtering
+	//https://www.scratchapixel.com/lessons/mathematics-physics-for-computer-graphics/interpolation/bilinear-filtering
+	float x = u * (float)texWidth;
+	float y = v * (float)texHeight;
+	float floorX = glm::floor(x);
+	float floorY = glm::floor(y);
+	float deltaX = x - floorX;
+	float deltaY = y - floorY;
+
+	//get the square for which we will perform bilinear interpolation
+	int xPos = (int)floorX;
+	int yPos = (int)floorY;
+	int xPlusOne = glm::clamp(xPos + 1, 0, texWidth - 1);
+	int yPlusOne = glm::clamp(yPos + 1, 0, texHeight - 1);
+
+	//get 4 color values
+	glm::vec3 c00 = getTextureColorAt(tex, texWidth, xPos, yPos);
+	glm::vec3 c10 = getTextureColorAt(tex, texWidth, xPlusOne, yPos);
+	glm::vec3 c01 = getTextureColorAt(tex, texWidth, xPos, yPlusOne);
+	glm::vec3 c11 = getTextureColorAt(tex, texWidth, xPlusOne, yPlusOne);
+
+	//bilinear interpolation between the above 4 colors
+	glm::vec3 c20 = glm::mix(c00, c10, deltaX);
+	glm::vec3 c21 = glm::mix(c01, c11, deltaX);
+	return glm::mix(c20, c21, deltaY);
+}
+
+__host__ __device__ 
+void modifyFragment(Primitive* dev_primitives, Fragment* dev_fragments, 
+					int* dev_depthBuffer, float& z,
+					glm::vec3 tri[3], glm::vec3 baryCoords,
+					int& index, int& fragIndex)
+{
+	glm::vec3 v0eyePos = dev_primitives[index].v[0].vEyePos;
+	glm::vec3 v1eyePos = dev_primitives[index].v[1].vEyePos;
+	glm::vec3 v2eyePos = dev_primitives[index].v[2].vEyePos;
+
+	//for perspective correct interpolation you need the z values
+	float z1 = v0eyePos.z;
+	float z2 = v1eyePos.z;
+	float z3 = v2eyePos.z;
+	float perpectiveCorrectZ = 1.0f/(baryCoords.x / v0eyePos.z + 
+									 baryCoords.y / v1eyePos.z + 
+									 baryCoords.z / v2eyePos.z );
+
+	glm::vec3 v0color = dev_primitives[index].v[0].vColor;
+	glm::vec3 v1color = dev_primitives[index].v[1].vColor;
+	glm::vec3 v2color = dev_primitives[index].v[2].vColor;
+
+	glm::vec3 v0Nor = dev_primitives[index].v[0].vNor;
+	glm::vec3 v1Nor = dev_primitives[index].v[1].vNor;
+	glm::vec3 v2Nor = dev_primitives[index].v[2].vNor;
+
+	glm::vec2 v0UV = dev_primitives[index].v[0].texcoord0;
+	glm::vec2 v1UV = dev_primitives[index].v[1].texcoord0;
+	glm::vec2 v2UV = dev_primitives[index].v[2].texcoord0;
+
+	TextureData* triangleDiffuseTex = dev_primitives[index].v[0].dev_diffuseTex;
+
+	//if testing Depth coloration
+	dev_fragments[fragIndex].dev_diffuseTex = triangleDiffuseTex;
+	dev_fragments[fragIndex].depth = dev_depthBuffer[fragIndex]/float(DEPTHSCALE);
+	dev_fragments[fragIndex].fNor = perpectiveCorrectZ*((v0Nor / z1)*baryCoords.x +
+														(v1Nor / z2)*baryCoords.y + 
+														(v2Nor / z3)*baryCoords.z );
+	dev_fragments[fragIndex].texcoord0 = perpectiveCorrectZ*((v0UV / z1)*baryCoords.x +
+															 (v0UV / z2)*baryCoords.y +
+															 (v0UV / z3)*baryCoords.z);
+
+	if (TEXTURE_MAPPING && dev_fragments[fragIndex].dev_diffuseTex != NULL)
+	{
+#if BILINEAR_FILTERING
+		dev_fragments[fragIndex].fColor = getBilinearFilteredColor(dev_fragments[fragIndex].dev_diffuseTex,
+																   dev_primitives[index].v[0].texWidth,
+																   dev_primitives[index].v[0].texHeight,
+																   dev_fragments[fragIndex].texcoord0[0],
+																   dev_fragments[fragIndex].texcoord0[1]);
+#else
+		int u = dev_fragments[fragIndex].texcoord0[0] * dev_primitives[index].v[0].texWidth;
+		int v = dev_fragments[fragIndex].texcoord0[1] * dev_primitives[index].v[0].texHeight;
+		dev_fragments[fragIndex].fColor = getTextureColorAt(dev_fragments[fragIndex].dev_diffuseTex,
+															dev_primitives[index].v[0].texWidth, u, v);
+#endif
 	}
+	else
+	{
+		dev_fragments[fragIndex].fColor = perpectiveCorrectZ*((v0color / z1)*baryCoords.x +
+															  (v1color / z2)*baryCoords.y +
+															  (v2color / z3)*baryCoords.z);
+	}
+
+	//to make the normals follow convention:
+	//z is positive coming out of the screen
+	//x is positive to the right
+	//y is positive going up
+	dev_fragments[fragIndex].fNor.x *= -1.0f;
+
+	//clamp color and normals values
+	dev_fragments[fragIndex].fNor.x = glm::clamp(dev_fragments[fragIndex].fNor.x, 0.0f, 1.0f);
+	dev_fragments[fragIndex].fNor.y = glm::clamp(dev_fragments[fragIndex].fNor.y, 0.0f, 1.0f);
+	dev_fragments[fragIndex].fNor.z = glm::clamp(dev_fragments[fragIndex].fNor.z, 0.0f, 1.0f);
+
+	dev_fragments[fragIndex].fColor.x = glm::clamp(dev_fragments[fragIndex].fColor.x, 0.0f, 1.0f);
+	dev_fragments[fragIndex].fColor.y = glm::clamp(dev_fragments[fragIndex].fColor.y, 0.0f, 1.0f);
+	dev_fragments[fragIndex].fColor.z = glm::clamp(dev_fragments[fragIndex].fColor.z, 0.0f, 1.0f);
+}
+
+__device__ 
+void DepthTest(Primitive* dev_primitives, Fragment* dev_fragments,
+			   int* dev_depthBuffer, int * dev_mutex, float& z,
+			   glm::vec3 tri[3], glm::vec3 baryCoords,
+			   int& index, int& fragIndex)
+{
+	//multiplying z value by a large static int because atomicCAS is only defined for ints
+	//and atomicCAS is needed to handle race conditions along with the mutex lock
+	int scaledZ = z*DEPTHSCALE;
+
+	bool isSet;
+	do
+	{
+		isSet = (atomicCAS(&dev_mutex[fragIndex], 0, 1) == 0);
+		if (isSet)
+		{
+			// Critical section goes here.
+			// if it is afterward, a deadlock will occur.
+			if (scaledZ < dev_depthBuffer[fragIndex])
+			{
+				dev_depthBuffer[fragIndex] = scaledZ;
+				modifyFragment(dev_primitives, dev_fragments, dev_depthBuffer, z,
+									tri, baryCoords, index, fragIndex);
+			}
+
+			dev_mutex[fragIndex] = 0;
+		}
+	} while (!isSet);
+}
+
+struct predicate_PrimitiveCulling
+{
+	__host__ __device__ bool operator()(const Primitive &x)
+	{
+		return (x.cull);
+	}
+};
+
+__global__
+void identifyBackFaces(const int numActivePrimitives, Primitive* prims, const glm::vec3 camForward)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
 	
+	if (index < numActivePrimitives)
+	{
+		//check if the normal of the triangle face the camera or not
+		glm::vec3 p1 = glm::vec3(prims[index].v[0].vPos);
+		glm::vec3 p2 = glm::vec3(prims[index].v[1].vPos);
+		glm::vec3 p3 = glm::vec3(prims[index].v[2].vPos);
+
+		glm::vec3 triangleNormal = glm::cross(p1 - p2, p2 - p3);
+		float dot = glm::dot(triangleNormal, camForward);
+
+		if (dot < 0.0f)
+		{
+			//cull this triangle
+			prims[index].cull = true;
+		}
+	}
 }
 
+void BackFaceCulling(int& numActivePrimitives, Primitive* dev_primitives, glm::vec3& camForward)
+{
+	dim3 numThreadsPerBlock(128);
+	dim3 blockSize1d((numActivePrimitives - 1) / numThreadsPerBlock.x + 1);
+
+	initCullValue <<<blockSize1d, numThreadsPerBlock>>> (numActivePrimitives, dev_primitives);
+
+	//identify and mark the triangles to be culled
+	identifyBackFaces <<<blockSize1d, numThreadsPerBlock>>> (numActivePrimitives, dev_primitives, camForward);
+	checkCUDAError("face identification failed");
+
+	//Stream Compact your array of dev_primitives to cull out primitives that cant be seen or lit in a scene
+	//thrust::partition returns a pointer to the element in the array where the partition occurs 
+	Primitive* partition_point = thrust::partition(thrust::device, 
+												   dev_primitives, 
+												   dev_primitives + numActivePrimitives, 
+												   predicate_PrimitiveCulling());
+	checkCUDAError("partitioning and streamcompaction failed");
+	numActivePrimitives = int(partition_point - dev_primitives);
+}
 
+__device__
+void _rasterizeTriangles(int w, int h, int index, glm::vec3 *tri,
+						 Primitive* dev_primitives, Fragment* dev_fragments, 
+						 int* dev_depthBuffer, int* dev_mutex)
+{
+	AABB boundingBox = getAABBForTriangle(tri);
 
-/**
- * Perform rasterization.
- */
-void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+	//clamp BB to be within the window
+	int BBminX = glm::min(w-1, glm::max(0, int(boundingBox.min.x)));
+	int BBmaxX = glm::max(0, glm::min(w-1, int(boundingBox.max.x)));
+	int BBminY = glm::min(h-1, glm::max(0, int(boundingBox.min.y)));
+	int BBmaxY = glm::max(0, glm::min(h-1, int(boundingBox.max.y)));
+
+	for (int y = BBminY; y <= BBmaxY; ++y)
+	{
+		for (int x = BBminX; x <= BBmaxX; ++x)
+		{
+			glm::vec3 baryCoords = calculateBarycentricCoordinate(tri, glm::vec2(x, y));
+			bool isInsideTriangle = isBarycentricCoordInBounds(baryCoords);
+			if (isInsideTriangle)
+			{
+				int fragIndex = x + y*w;
+				float z = -getZAtCoordinate(baryCoords, tri);
+#if DEPTH_TEST
+				DepthTest(dev_primitives, dev_fragments, dev_depthBuffer, dev_mutex,
+					z, tri, baryCoords, index, fragIndex);
+#else
+				modifyFragment(dev_primitives, dev_fragments, dev_depthBuffer, z,
+					tri, baryCoords, index, fragIndex);
+#endif
+			}
+		}
+	}
+}
+
+__device__
+void _rasterizeTriangleAsLines(int width, int height, const int *indicies,
+								Fragment* dev_fragments, glm::vec3 *tri)
+{
+	int x1, x2, y1, y2, dx, dy, y, fragIndex;
+	for (int index = 0; index < 6; index += 2)
+	{
+		x1 = tri[indicies[index]].x;    
+		y1 = tri[indicies[index]].y;
+		x2 = tri[indicies[index + 1]].x;  
+		y2 = tri[indicies[index + 1]].y;
+		dx = x2 - x1;                   
+		dy = y2 - y1;
+		for (int x = x1; x <= x2; x++)
+		{
+			y = y1 + dy * (x - x1) / dx;
+			fragIndex = x + y * width;
+			if ((x >= 0 && x <= width - 1) && 
+				(y >= 0 && y <= height - 1))
+			{
+				dev_fragments[fragIndex].fColor = glm::vec3(0.0f, 0.0f, 1.0f);
+			}				
+		}
+	}
+}
+
+__device__
+void _rasterizeTriangleAsPoints(int width, int height, Fragment* dev_fragments, glm::vec3 *tri)
+{
+	int x, y, fragIndex;
+	for (int vertexId = 0; vertexId < 3; ++vertexId)
+	{
+		x = tri[vertexId].x;
+		y = tri[vertexId].y;
+		int fragIndex = x + y * width;
+		if ((x >= 0 && x <= width - 1) && 
+			(y >= 0 && y <= height - 1))
+		{
+			dev_fragments[fragIndex].fColor = glm::vec3(1.0f, 0.0f, 0.0f);
+		}
+	}
+}
+
+__global__ 
+void _rasterizeScanLine(int w, int h, int numTriangles, Primitive* dev_primitives, 
+						Fragment* dev_fragments, int* dev_depthBuffer, int* dev_mutex)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < numTriangles)
+	{
+		glm::vec3 tri[3];
+		tri[0] = glm::vec3(dev_primitives[index].v[0].vPos);
+		tri[1] = glm::vec3(dev_primitives[index].v[1].vPos);
+		tri[2] = glm::vec3(dev_primitives[index].v[2].vPos);
+
+#if RASTERIZE_TRIANGLES
+		_rasterizeTriangles(w, h, index, tri, dev_primitives, dev_fragments, dev_depthBuffer, dev_mutex);
+#endif
+#if RASTERIZE_LINES
+		const int indices[] = { 0,1,1,2,2,0 };
+		_rasterizeTriangleAsLines(w, h, indices, dev_fragments, tri);
+#endif
+#if RASTERIZE_POINTS
+		_rasterizeTriangleAsPoints(w, h, dev_fragments, tri);
+#endif
+
+	}
+}
+
+__global__ 
+void RasterizePixels(int pixelXoffset, int pixelYoffset , int numpixelsX, int numpixelsY, 
+					 int imageWidth, int tileID, Tile* dev_tiles, 
+					 Primitive* dev_primitives, Fragment* dev_fragments, 
+					 int * dev_tileTriCount,
+					 int* dev_depthBuffer, int* dev_mutex)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = (x+pixelXoffset) + ((y+ pixelYoffset) * imageWidth);
+		
+	if (x < numpixelsX && y < numpixelsY)
+	{
+		//Each thread loops over the triangles inside the tile
+		//Discard tiles (ie kernel launches) that dont have any triangles inside them --> implicitly done by for loop
+		for (int i = 0; i < dev_tileTriCount[tileID]; i++)
+		{
+#if DISPLAY_TILES
+			dev_fragments[index].fColor = glm::vec3(1, 0, 0);
+			return;
+#endif
+			int triangleIndex = dev_tiles[tileID].triIndices[i];
+			glm::vec3 tri[3];
+			tri[0] = glm::vec3(dev_primitives[triangleIndex].v[0].vPos);
+			tri[1] = glm::vec3(dev_primitives[triangleIndex].v[1].vPos);
+			tri[2] = glm::vec3(dev_primitives[triangleIndex].v[2].vPos);
+
+			int _x = (x + pixelXoffset);
+			int _y = (y+ pixelYoffset);
+
+			glm::vec3 baryCoords = calculateBarycentricCoordinate(tri, glm::vec2(_x, _y));
+			bool isInsideTriangle = isBarycentricCoordInBounds(baryCoords);
+			if (isInsideTriangle)
+			{
+				int fragIndex = index;
+				float z = -getZAtCoordinate(baryCoords, tri);
+#if DEPTH_TEST
+				DepthTest(dev_primitives, dev_fragments, dev_depthBuffer, dev_mutex,
+					z, tri, baryCoords, triangleIndex, fragIndex);
+#else
+				modifyFragment(dev_primitives, dev_fragments, dev_depthBuffer, z,
+					tri, baryCoords, triangleIndex, fragIndex);
+#endif
+			}
+		}	
+	}
+}
+
+__global__ void bucketPrims_TileMutex(int w, int stride_x, int stride_y,
+									int numTriangles, 
+									Primitive* dev_primitives,
+									Tile* tiles,
+									int * tileTriCount, 
+									int * dev_tilemutex)
+{
+	//does the bucketing of primitives into tiles but tries to avoid race conditions by updating a list of bools that 
+	//correspond to the tiles the window is divided into. The list of bools exists per primitive.
+	//This bool list is later compiled per primitive to get the total number of primitives in a tile
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < numTriangles)
+	{
+		glm::vec3 tri[3];
+		tri[0] = glm::vec3(dev_primitives[index].v[0].vPos);
+		tri[1] = glm::vec3(dev_primitives[index].v[1].vPos);
+		tri[2] = glm::vec3(dev_primitives[index].v[2].vPos);
+		AABB boundingBox = getAABBForTriangle(tri);
+
+		//if boundingbox of triangle lies inside tile add it to tile triangle list
+		int tilesX = (int)glm::ceil(double(w) / double(stride_x));
+
+		int tileidminX = glm::floor(boundingBox.min.x / stride_x);
+		int tileidmaxX = glm::ceil(boundingBox.max.x / stride_x);
+		int tileidminY = glm::floor(boundingBox.min.y / stride_y);
+		int tileidmaxY = glm::ceil(boundingBox.max.y / stride_y);
+					
+		//use mutex lock
+		for (int i = tileidminY; i < tileidmaxY; i++)
+		{
+			for (int j = tileidminX; j < tileidmaxX; j++)
+			{
+				int tileID = j + i*(tilesX);
+				bool isSet;
+				do
+				{
+					isSet = (atomicCAS(&dev_tilemutex[tileID], 0, 1) == 0);
+					if (isSet)
+					{
+						// Critical section goes here.
+						// if it is afterward, a deadlock will occur.
+						int t = tileTriCount[tileID];
+						tiles[tileID].triIndices[t] = index;
+						tileTriCount[tileID] = t + 1;
+
+						dev_tilemutex[tileID] = 0;
+					}
+				} while (!isSet);
+
+			}
+		}
+	}
+}
+
+__global__ 
+void resetTiles(int numTiles, int stride_x, int stride_y, Tile* dev_tiles, int* dev_tileTriCount)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index < numTiles)
+	{
+		for (int i = 0; i < dev_tileTriCount[index]; i++)
+		{
+			dev_tiles[index].triIndices[i] = 0;
+		}
+		dev_tileTriCount[index] = 0;
+	}
+}
+
+void rasterizeTileBased(int w, int h, int numTriangles, Tile* dev_tiles, Primitive* dev_primitives,
+						Fragment* dev_fragments, int* dev_depthBuffer, int* dev_mutex)
+{
+	int stride_x = glm::floor(w / numTilesX);
+	int stride_y = glm::floor(h / numTilesY);
+	int numTiles = (numTilesX+1)*(numTilesY+1);
+	dim3 numThreadsPerBlock(128);
+
+	//reset tile triangles
+	dim3 blockCount1d_tiles(((numTiles - 1) / numThreadsPerBlock.x) + 1);
+	resetTiles <<<blockCount1d_tiles, numThreadsPerBlock>>> (numTiles, stride_x, stride_y, 
+															 dev_tiles, dev_tileTriCount);
+
+	//preprocess step looping over all triangles to bin them into buckets corresponding to the tiles
+	dim3 blockCount1d_triangles(((numTriangles - 1) / numThreadsPerBlock.x) + 1);
+	bucketPrims_TileMutex <<<blockCount1d_triangles, numThreadsPerBlock >>> (w, stride_x, stride_y,
+																			 numTriangles, dev_primitives,
+																			 dev_tiles, dev_tileTriCount, 
+																			 dev_tilemutex);
+
+	int sideLength2d = 8;
+	dim3 blockSize2d(sideLength2d, sideLength2d);
+
+	int tilesX = glm::ceil(w / stride_x);
+	int tileXcount = 0;
+	int tileYcount = 0;
+	for (int i = 0; i < w; i+=stride_y)
+	{
+		for (int j = 0; j < h; j+=stride_x)
+		{		
+			//Launch as many kernels as their are tiles			
+			int tileID = tileXcount + tileYcount*(tilesX);
+			//Each kernel is launched for the pixels contatined within it
+			glm::ivec2 pixelMin = glm::ivec2(tileXcount*stride_x, tileYcount*stride_y);
+			glm::ivec2 pixelMax = glm::ivec2(glm::min((tileXcount+1)*stride_x, w-1), 
+											 glm::min((tileYcount+1)*stride_y, h-1));
+
+			int numpixelsX = pixelMax.x - pixelMin.x;
+			int numpixelsY = pixelMax.y - pixelMin.y;
+
+			int pixelXoffset = tileXcount*stride_x;
+			int pixelYoffset = tileYcount*stride_y;
+
+			dim3 blockCount2d_tilePixels((numpixelsX - 1) / blockSize2d.x + 1,
+										 (numpixelsY - 1) / blockSize2d.y + 1);
+			RasterizePixels <<<blockCount2d_tilePixels, blockSize2d >>> (pixelXoffset, pixelYoffset,
+																		 numpixelsX, numpixelsY, w,
+																		 tileID, dev_tiles, 
+																		 dev_primitives,
+																		 dev_fragments,
+																		 dev_tileTriCount,
+																		 dev_depthBuffer, dev_mutex);
+			checkCUDAError("tile rasterization failed");
+
+			tileXcount++;
+		}
+		tileXcount = 0;
+		tileYcount++;		
+	}
+}
+
+//Perform rasterization.
+void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) 
+{
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
-		(height - 1) / blockSize2d.y + 1);
-
-	// Execute your rasterization pipeline here
-	// (See README for rasterization pipeline outline.)
-
+					  (height - 1) / blockSize2d.y + 1);
+	
+	//------------------------------------------------
+	//Timer Start
+	//timeStartCpu = std::chrono::high_resolution_clock::now();
+	//------------------------------------------------
+
+	//----------------------------------------------------------
+	//----------------- Rasterization pipeline------------------
+	//----------------------------------------------------------
 	// Vertex Process & primitive assembly
+
 	{
 		curPrimitiveBeginId = 0;
 		dim3 numThreadsPerBlock(128);
@@ -695,21 +1321,21 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 		auto it = mesh2PrimitivesMap.begin();
 		auto itEnd = mesh2PrimitivesMap.end();
 
-		for (; it != itEnd; ++it) {
+		for (; it != itEnd; ++it) 
+		{
 			auto p = (it->second).begin();	// each primitive
 			auto pEnd = (it->second).end();
-			for (; p != pEnd; ++p) {
+			for (; p != pEnd; ++p) 
+			{
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				_vertexTransformAndAssembly <<<numBlocksForVertices, numThreadsPerBlock>>>(p->numVertices, *p, MVP, MV,
+																							MV_normal, width, height);
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
-				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
-					(p->numIndices, 
-					curPrimitiveBeginId, 
-					dev_primitives, 
-					*p);
+				_primitiveAssembly <<<numBlocksForIndices, numThreadsPerBlock>>> (p->numIndices, curPrimitiveBeginId, 
+																						dev_primitives, *p);
 				checkCUDAError("Primitive Assembly");
 
 				curPrimitiveBeginId += p->numPrimitives;
@@ -719,26 +1345,57 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
 	
+	numActivePrimitives = totalNumPrimitives;
+#if BACKFACE_CULLING
+	glm::vec3 camForward = glm::vec3(1.0f,1.0f,1.0f);
+	BackFaceCulling(numActivePrimitives, dev_primitives, camForward);
+#endif
+
+	//reset fragment buffer
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
+	//Reset Depth Buffer and mutex lock for depth buffer
+	cudaMemset(dev_mutex, 0, width * height * sizeof(int)); //mutex for depth buffer
+	initDepth <<<blockCount2d, blockSize2d >>>(width, height, dev_depth);
 	
-	// TODO: rasterize
-
-
+	dim3 numThreadsPerBlock(128);
+#if SCANLINE
+	// rasterize --> looping over all primitives(triangles)
+	dim3 blockSize1d((numActivePrimitives - 1) / numThreadsPerBlock.x + 1);
+	_rasterizeScanLine <<<blockSize1d, numThreadsPerBlock>>>(width, height, numActivePrimitives,
+															 dev_primitives, dev_fragmentBuffer,
+															 dev_depth, dev_mutex);
+	checkCUDAError("scanline rendering failed");
+#endif
+
+#if TILEBASED
+	cudaMemset(dev_tilemutex, 0, maxNumTiles * sizeof(int)); //mutex for tileTriCount buffer
+	cudaMemset(dev_tileTriCount, 0, maxNumTiles * sizeof(int));
+
+	rasterizeTileBased(width, height, numActivePrimitives, dev_tiles, dev_primitives, dev_fragmentBuffer,
+					   dev_depth, dev_mutex);
+	checkCUDAError("tile based rendering failed");
+#endif
 
     // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	render <<<blockCount2d, blockSize2d >>>(width, height, dev_fragmentBuffer, dev_framebuffer);
 	checkCUDAError("fragment shader");
+
+	//------------------------------------------------
+	//Timer End
+	//timeEndCpu = std::chrono::high_resolution_clock::now();
+	//std::chrono::duration<double, std::milli> duration = timeEndCpu - timeStartCpu;
+	//prevElapsedTime = static_cast<decltype(prevElapsedTime)>(duration.count());
+	//printf("%f\n", prevElapsedTime);
+	//------------------------------------------------
+
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
     sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
     checkCUDAError("copy render result to pbo");
 }
 
-/**
- * Called once at the end of the program to free CUDA memory.
- */
-void rasterizeFree() {
-
+//Called once at the end of the program to free CUDA memory.
+void rasterizeFree() 
+{
     // deconstruct primitives attribute/indices device buffer
 
 	auto it(mesh2PrimitivesMap.begin());
@@ -752,13 +1409,18 @@ void rasterizeFree() {
 			cudaFree(p->dev_diffuseTex);
 
 			cudaFree(p->dev_verticesOut);
-
-			
-			//TODO: release other attributes and materials
 		}
 	}
 
 	////////////
+	cudaFree(dev_tiles);
+	dev_tiles = NULL;
+
+	cudaFree(dev_tileTriCount);
+	dev_tileTriCount = NULL;
+
+	cudaFree(dev_tilemutex);
+	dev_tilemutex = NULL;
 
     cudaFree(dev_primitives);
     dev_primitives = NULL;
@@ -772,5 +1434,8 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_mutex);
+	dev_depth = NULL;
+
     checkCUDAError("rasterize Free");
 }
diff --git a/src/rasterizeTools.h b/src/rasterizeTools.h
index 46c701e..47a52c9 100644
--- a/src/rasterizeTools.h
+++ b/src/rasterizeTools.h
@@ -12,25 +12,23 @@
 #include <glm/glm.hpp>
 #include <util/utilityCore.hpp>
 
-struct AABB {
+struct AABB 
+{
     glm::vec3 min;
     glm::vec3 max;
 };
 
-/**
- * Multiplies a glm::mat4 matrix and a vec4.
- */
+//Multiplies a glm::mat4 matrix and a vec4.
 __host__ __device__ static
 glm::vec3 multiplyMV(glm::mat4 m, glm::vec4 v) {
     return glm::vec3(m * v);
 }
 
-// CHECKITOUT
-/**
- * Finds the axis aligned bounding box for a given triangle.
- */
+
+//Finds the axis aligned bounding box for a given triangle.
 __host__ __device__ static
-AABB getAABBForTriangle(const glm::vec3 tri[3]) {
+AABB getAABBForTriangle(const glm::vec3 tri[3]) 
+{
     AABB aabb;
     aabb.min = glm::vec3(
             min(min(tri[0].x, tri[1].x), tri[2].x),
@@ -43,19 +41,14 @@ AABB getAABBForTriangle(const glm::vec3 tri[3]) {
     return aabb;
 }
 
-// CHECKITOUT
-/**
- * Calculate the signed area of a given triangle.
- */
+//Calculate the signed area of a given triangle.
 __host__ __device__ static
-float calculateSignedArea(const glm::vec3 tri[3]) {
+float calculateSignedArea(const glm::vec3 tri[3]) 
+{
     return 0.5 * ((tri[2].x - tri[0].x) * (tri[1].y - tri[0].y) - (tri[1].x - tri[0].x) * (tri[2].y - tri[0].y));
 }
 
-// CHECKITOUT
-/**
- * Helper function for calculating barycentric coordinates.
- */
+//Helper function for calculating barycentric coordinates.
 __host__ __device__ static
 float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c, const glm::vec3 tri[3]) {
     glm::vec3 baryTri[3];
@@ -65,10 +58,7 @@ float calculateBarycentricCoordinateValue(glm::vec2 a, glm::vec2 b, glm::vec2 c,
     return calculateSignedArea(baryTri) / calculateSignedArea(tri);
 }
 
-// CHECKITOUT
-/**
- * Calculate barycentric coordinates.
- */
+//Calculate barycentric coordinates.
 __host__ __device__ static
 glm::vec3 calculateBarycentricCoordinate(const glm::vec3 tri[3], glm::vec2 point) {
     float beta  = calculateBarycentricCoordinateValue(glm::vec2(tri[0].x, tri[0].y), point, glm::vec2(tri[2].x, tri[2].y), tri);
@@ -77,10 +67,7 @@ glm::vec3 calculateBarycentricCoordinate(const glm::vec3 tri[3], glm::vec2 point
     return glm::vec3(alpha, beta, gamma);
 }
 
-// CHECKITOUT
-/**
- * Check if a barycentric coordinate is within the boundaries of a triangle.
- */
+//Check if a barycentric coordinate is within the boundaries of a triangle.
 __host__ __device__ static
 bool isBarycentricCoordInBounds(const glm::vec3 barycentricCoord) {
     return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
@@ -88,14 +75,35 @@ bool isBarycentricCoordInBounds(const glm::vec3 barycentricCoord) {
            barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
 }
 
-// CHECKITOUT
-/**
- * For a given barycentric coordinate, compute the corresponding z position
- * (i.e. depth) on the triangle.
- */
+//For a given barycentric coordinate, compute the corresponding z position
+//(i.e. depth) on the triangle.
 __host__ __device__ static
 float getZAtCoordinate(const glm::vec3 barycentricCoord, const glm::vec3 tri[3]) {
     return -(barycentricCoord.x * tri[0].z
            + barycentricCoord.y * tri[1].z
            + barycentricCoord.z * tri[2].z);
 }
+
+//For edge function rasterization --> the actual technique used in mordern GPUs thats much faster than scanline
+__host__ __device__ static
+bool edgeFunction(const glm::vec3 &a, const glm::vec3 &b, const glm::vec2 &p)
+{
+	//the dot product thing commented below is essentially what the function 
+	//is doing but the dot product is something i can easily understand at a glance
+	//glm::vec3 v1 = b-a;
+	//glm::vec3 v2 = p-a;	
+	//return (glm::dot(v1, v2)>=0);
+
+	return ((p.x - a.x) * (b.y - a.y) - (p.y - a.y) * (b.x - a.x) >= 0);
+}
+
+__host__ __device__ static
+bool IsPointInsideTriangle(const glm::vec3 &V0, const glm::vec3 &V1, const glm::vec3 &V2, const glm::vec2 &p)
+{
+	bool inside = true;
+	inside &= edgeFunction(V0, V1, p);
+	inside &= edgeFunction(V1, V2, p);
+	inside &= edgeFunction(V2, V0, p);
+
+	return inside;
+}