diff --git a/README.md b/README.md
index cad1abd..90695b9 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,64 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
+![](renders/header.png)
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Mariano Merchante
+* Tested on
+  * Microsoft Windows 10 Pro
+  * Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 2601 Mhz, 4 Core(s), 8 Logical Processor(s)
+  * 32.0 GB RAM
+  * NVIDIA GeForce GTX 1070 (mobile version)
+  
+## Details
+This project implements a hierarchical tiled rasterizer in CUDA. It subdivides the screen multiple times and stores polygon data at different levels to optimize memory usage, and then traverses through this hierarchy when rendering.
+
+A video can be found [here](https://vimeo.com/238739035)
+
+## Rasterization
+The rasterization aspect is very similar to other approaches, with the difference that instead of iterating through all primitives and doing scanline rasterization, it follows these steps:
+
+- Builds a tile data structure that contains all hierarchy levels and enough memory for primitive indices.
+  - It uses a logarithmic scale to increase the primitive capacity as the tile becomes bigger.
+- On each frame:
+  - Clears every tile primitive counter
+  - Iterates over all primitives and stores them at the correct level on the hierarchy. It uses an atomic counter to keep track of how many primitives the tile must render. To select which level to store the primitive, it checks with how many tiles it intersects.
+  - Rasterizes each tile, iterating through all found primitives and up through the hierarchy until the biggest level is reached.
+    - Note that because each tile is running in parallel, z testing has no race condition and thus can be trivially done in the tile kernel.
+    
+Because most of the effort was put into the tiled rendering approach, no fancy methods such as texturing, AA or image effects were implemented, given that these are usually trivial.
+    
+## Specific optimizations
+
+- Backface culling
+- Early Z-reject 
+- Pineda algorithm for triangle rasterization
+
+
+# Overdraw and early rejection
+Note that when early rejection is enabled, primitive overdraw is reduced because triangles are discarded before doing per fragment z-testing.
+
+![](renders/overdraw.png)
+
+Overdraw without early reject
+
+![](renders/overdraw_early.png)
+
+Overdraw with early reject
+   
+## Results
+This approach seems to be very good when the geometry is balanced throughout different tile levels. If, for example, the full scene can be placed on one small tile, performance can drop dramatically, and can even lose primitives. This can be mitigated by doing multiple passes until all geometry is rasterized, but it is not implemented.
+
+Memory consumption is a big issue too, and the logarithmic scale used for different hierarchy capacities is used to mitigate the fact that as tiles become bigger, more primitives are going to intersect with them. 
 
-### (TODO: Your README)
+As expected, the results are also very dependent on the tile size, the amount of subdivisions, and also the threshold used for placing primitives at specific levels of the hierarchy.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+An important distinction is that this approach does not use shared memory, so 16x16 tiles are actually really slow because of global memory access. When tiles are smaller, around 4x4, the algorithm benefits from cache accesses and performance improves drastically.
 
+## Improvements
+* The fixed, preallocated primitive memory is not ideal, and maybe extending the atomic counter idea to a dynamic global list may be useful at different hierarchy levels.
+* Automatically optimizing certain parameters, such as tile size, subdivision levels, coverage count, etc. for the specified scene data could be interesting.
 
 ### Credits
 
diff --git a/renders/example.mp4 b/renders/example.mp4
new file mode 100644
index 0000000..3c44216
Binary files /dev/null and b/renders/example.mp4 differ
diff --git a/renders/header.png b/renders/header.png
new file mode 100644
index 0000000..b83d44e
Binary files /dev/null and b/renders/header.png differ
diff --git a/renders/overdraw.png b/renders/overdraw.png
new file mode 100644
index 0000000..38ef0dc
Binary files /dev/null and b/renders/overdraw.png differ
diff --git a/renders/overdraw_early.png b/renders/overdraw_early.png
new file mode 100644
index 0000000..d6df6eb
Binary files /dev/null and b/renders/overdraw_early.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..b2542f2 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -138,8 +138,8 @@ bool init(const tinygltf::Scene & scene) {
         return false;
     }
 
-    width = 800;
-    height = 800;
+    width = 1280;
+    height = 720;
     window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
     if (!window) {
         glfwTerminate();
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..94774cd 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -17,6 +17,9 @@
 #include "rasterize.h"
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
+#include <device_launch_parameters.h>
+#include <device_atomic_functions.h>
+
 
 namespace {
 
@@ -34,6 +37,23 @@ namespace {
 		Triangle = 3
 	};
 
+	struct RenderSubdivision {
+		int tileOffset;
+		int tileCount; // Amount of tiles
+		int tileSize; // Size of tiles at this level
+	};
+
+	struct RenderTile {
+		int tileLength; // The size in pixels
+		int tileLevel;	// Subdivision level
+		int capacity; // The amount of primitives that can hold
+		int primitiveOffset; // The index on the primitive buffer
+		int currentIndex; // The current last primitive set
+		glm::ivec2 from; // AABB min
+		glm::ivec2 to; // AABB max
+		int parentIndex;
+	};
+
 	struct VertexOut {
 		glm::vec4 pos;
 
@@ -53,20 +73,15 @@ namespace {
 	struct Primitive {
 		PrimitiveType primitiveType = Triangle;	// C++ 11 init
 		VertexOut v[3];
+		glm::vec3 min;
+		glm::vec3 max;
 	};
 
 	struct Fragment {
 		glm::vec3 color;
-
-		// TODO: add new attributes to your Fragment
-		// The attributes listed below might be useful, 
-		// but always feel free to modify on your own
-
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
-		// ...
+		glm::vec3 eyePos;
+		glm::vec3 eyeNor;
+		int overdraw;
 	};
 
 	struct PrimitiveDevBufPointers {
@@ -100,16 +115,26 @@ namespace {
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
+static int TILE_SIZE = 4; // In pixels
+static int TILE_PRIMITIVE_CAPACITY_BASE = 1024;
 
 static int width = 0;
 static int height = 0;
+static int effectiveTileSubdivisions = 0;
+static int totalTiles = 0;
 
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
+static int * dev_fragmentMutex = NULL;
+
+static RenderTile * dev_tile_headers = NULL; // Tile information
+static int * dev_tile_primitives = NULL; // The tile primitive indices
+static RenderSubdivision  * dev_subdivisions = NULL; // The offsets for each subdivision level
+
+static RenderSubdivision * subdivisionData = NULL; // Client!
 
-static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -125,6 +150,7 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
         color.x = glm::clamp(image[index].x, 0.0f, 1.0f) * 255.0;
         color.y = glm::clamp(image[index].y, 0.0f, 1.0f) * 255.0;
         color.z = glm::clamp(image[index].z, 0.0f, 1.0f) * 255.0;
+
         // Each thread writes one pixel location in the texture (textel)
         pbo[index].w = 0;
         pbo[index].x = color.x;
@@ -142,12 +168,8 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
-
-		// TODO: add your fragment shader code here
-
-    }
+    if (x < w && y < h) 
+		framebuffer[index] = glm::pow(glm::max(fragmentBuffer[index].color, glm::vec3(0.f)), glm::vec3(.454f)); 
 }
 
 /**
@@ -159,14 +181,104 @@ void rasterizeInit(int w, int h) {
 	cudaFree(dev_fragmentBuffer);
 	cudaMalloc(&dev_fragmentBuffer, width * height * sizeof(Fragment));
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-    cudaFree(dev_framebuffer);
+    
+	cudaFree(dev_framebuffer);
     cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
+
+	cudaMalloc(&dev_fragmentMutex, width * height * sizeof(int));
+	cudaMemset(dev_fragmentMutex, 0, width * height * sizeof(int));
+
+	int maxDimension = glm::max(w, h);
+	int maxSubdivisions = 8;
+	int tileHeaderMemory = 0;
+	int tilePrimitiveMemory = 0;
+
+	// Find the amount of subdivisions we will need
+	for (int i = 0; i < maxSubdivisions; i++)
+	{
+		int tileSize = TILE_SIZE * glm::pow(2, i);
+		int tileCount = glm::ceil(maxDimension / (float)tileSize);
+
+		totalTiles += tileCount * tileCount;
+		tilePrimitiveMemory += tileCount * tileCount * TILE_PRIMITIVE_CAPACITY_BASE * sizeof(int) * glm::log(i + 2);
+		
+		if (tileSize >= maxDimension)
+		{
+			effectiveTileSubdivisions = i;
+			break;
+		}
+	}
+
+	effectiveTileSubdivisions = glm::max(effectiveTileSubdivisions, 1);
+	
+	// Build the render tile header data
+	int tileOffset = 0;
+	RenderTile * tileHeaderData = new RenderTile[totalTiles];
+	subdivisionData = new RenderSubdivision[effectiveTileSubdivisions];
+
+	int currentPrimitiveOffset = 0;
+
+	for (int i = 0; i < effectiveTileSubdivisions; i++)
+	{
+		RenderTile tile;
+		tile.tileLevel = i;
+		tile.tileLength = TILE_SIZE * glm::pow(2, i);
+		tile.currentIndex = 0; // No primitives yet! This index must be cleared on each frame
+
+		int tileCount = glm::ceil(maxDimension / (float)tile.tileLength);
+		tile.capacity = TILE_PRIMITIVE_CAPACITY_BASE * glm::log(i + 2);
+
+		// Precompute some information for this subdivision level
+		subdivisionData[i].tileOffset = tileOffset;
+		subdivisionData[i].tileCount = tileCount;
+		subdivisionData[i].tileSize = tile.tileLength;
+
+		for (int y = 0; y < tileCount; ++y)
+		{
+			for (int x = 0; x < tileCount; ++x)
+			{
+				tile.from = glm::clamp(glm::vec2(x * tile.tileLength, y * tile.tileLength), glm::vec2(0.f), glm::vec2(width, height));
+				tile.to = glm::clamp(glm::vec2((x+1) * tile.tileLength, (y+1) * tile.tileLength), glm::vec2(0.f), glm::vec2(width, height));
+				tile.primitiveOffset = currentPrimitiveOffset + ((y * tileCount) + x) * tile.capacity;
+
+				// Precompute the next parent tile index
+				int parentX = x / 2;
+				int parentY = y / 2;
+				tile.parentIndex = tileOffset + (tileCount * tileCount) + (parentY * (tileCount/2)) + parentX;
+
+				tileHeaderData[tileOffset + (y * tileCount) + x] = tile;
+			}
+		}
+
+		printf("Tile offset for level %d : %d, primitive offset: %d, (buffer size: %d), capacity: %d \n", i, tileOffset, currentPrimitiveOffset,(tileCount * tileCount), tile.capacity);
+
+		currentPrimitiveOffset += tileCount * tileCount * tile.capacity;
+		tileOffset += tileCount * tileCount;
+	}
+
+	tileHeaderMemory = totalTiles * sizeof(RenderTile);
+	
+	printf("Size [%d,%d] | Tile subdivisions: %d \n", w, h, effectiveTileSubdivisions);
+	printf("Tile header memory: %f MB | Tile primitive memory: %f MB \n", tileHeaderMemory / (1024.f * 1024.f), (tilePrimitiveMemory / (1024.f * 1024.f)));
+	
+	int subdivisionMemory = effectiveTileSubdivisions * sizeof(RenderSubdivision);
+
+	cudaMalloc(&dev_subdivisions, subdivisionMemory);
+	checkCUDAError("Alloc tile subdivisions");
+	cudaMemcpy(dev_subdivisions, subdivisionData, subdivisionMemory, cudaMemcpyHostToDevice);
+	checkCUDAError("Copy tile subdivisions");
+
+	cudaMalloc(&dev_tile_headers, tileHeaderMemory);
+	checkCUDAError("Alloc tile headers");
+	cudaMemcpy(dev_tile_headers, tileHeaderData, tileHeaderMemory, cudaMemcpyHostToDevice);
+	checkCUDAError("Copy tile headers");
+
+	cudaMalloc(&dev_tile_primitives, tilePrimitiveMemory);
+	cudaMemset(dev_tile_primitives, -1, tilePrimitiveMemory); // -1 means invalid index!
     
-	cudaFree(dev_depth);
-	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
-	checkCUDAError("rasterizeInit");
+	checkCUDAError("rasterizeInit"); 
 }
 
 __global__
@@ -617,64 +729,242 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 		checkCUDAError("Free BufferView Device Mem");
 	}
+}
+
+__global__ 
+void _vertexTransformAndAssembly(int numVertices, PrimitiveDevBufPointers primitive,  glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, int width, int height) 
+{
+	// vertex id
+	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (vid < numVertices) 
+	{
+		glm::vec3 p = primitive.dev_position[vid];
+		glm::vec3 n = primitive.dev_normal[vid];
+		glm::vec3 eyeNormal = MV_normal * n;
+		glm::vec3 eyePos = glm::vec3(MV * glm::vec4(p, 1.f));
+
+		glm::vec4 ssPos = MVP * glm::vec4(p, 1.f);
 
+		if(ssPos.w != 0.f)
+			ssPos /= ssPos.w;
 
+		ssPos.x = (ssPos.x * .5f + .5f) * width;
+		ssPos.y = (ssPos.y * -.5f + .5f) * height;
+		ssPos.z = 1.f / eyePos.z;		
+
+		VertexOut out;
+		out.pos = ssPos;
+		out.eyePos = eyePos;
+		out.eyeNor = eyeNormal;		
+
+		primitive.dev_verticesOut[vid] = out;
+	}
 }
 
+__global__
+void clearTileIndices(int totalTileCount, RenderTile * dev_tile_header)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
 
+	if (index < totalTileCount)
+		dev_tile_header[index].currentIndex = 0;
+}
 
-__global__ 
-void _vertexTransformAndAssembly(
-	int numVertices, 
-	PrimitiveDevBufPointers primitive, 
-	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
+// Pineda
+__forceinline__
+__host__ __device__
+float edgeFunction(glm::vec2 &a, glm::vec2  &b, glm::vec2 &c)
+{
+	return ((c[0] - a[0]) * (b[1] - a[1])) - ((c[1] - a[1]) * (b[0] - a[0]));
+}
 
-	// vertex id
-	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
-	if (vid < numVertices) {
+__global__
+void rasterizeTiles(int numTiles, int numSubdivisions, int width, int height, RenderSubdivision * dev_subdivisions, RenderTile * dev_tile_header, int * dev_tile_primitives, Primitive* dev_primitives, Fragment *dev_fragmentBuffer, int * dev_fragmentMutex)
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-		// TODO: Apply vertex transformation here
-		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
-		// Then divide the pos by its w element to transform into NDC space
-		// Finally transform x and y to viewport space
+	if (index < numTiles)
+	{
+		// Base tile is always smallest
+		RenderTile & baseTile = dev_tile_header[index];
 
-		// TODO: Apply vertex assembly here
-		// Assemble all attribute arraies into the primitive array
-		
+		for (int y = baseTile.from.y; y < baseTile.to.y; y++)
+		{
+			for (int x = baseTile.from.x; x < baseTile.to.x; x++)
+			{
+				glm::vec2 point = glm::vec2(x, y);
+				Fragment resultFragment;
+				resultFragment.color = glm::vec3(0.f);
+				
+				float depth = 2000.f;
+				int overdraw = 0;
+
+				RenderTile * tile = &baseTile;
+				for (int i = 0; i < numSubdivisions; i++)
+				{
+					int totalPrimitives = glm::min(tile->currentIndex, tile->capacity);
+
+					for (int p = 0; p < totalPrimitives; p++)
+					{
+						int primitiveIndex = dev_tile_primitives[tile->primitiveOffset + p];
+						Primitive & prim = dev_primitives[primitiveIndex];
+
+						// Early Z reject
+						if (-1.f / prim.min.z > depth)
+							continue;
+
+						if (x >= prim.min.x && x <= prim.max.x && y >= prim.min.y && y <= prim.max.y)
+						{
+							glm::vec2 v0 = glm::vec2(prim.v[0].pos);
+							glm::vec2 v1 = glm::vec2(prim.v[1].pos);
+							glm::vec2 v2 = glm::vec2(prim.v[2].pos);
+
+							float area = edgeFunction(v0, v1, v2);
+							float u = edgeFunction(v1, v2, point);
+							float v = edgeFunction(v2, v0, point);
+							float w = edgeFunction(v0, v1, point);
+
+							if(u >= 0 && v >= 0 && w >= 0) 
+							{
+								u /= area;
+								v /= area;
+								w /= area;
+
+								float interpolatedDepth = u * prim.v[0].pos.z + v * prim.v[1].pos.z + w * prim.v[2].pos.z;
+								
+								if(interpolatedDepth != 0.f)
+									interpolatedDepth = -1.f / interpolatedDepth;
+
+								if (interpolatedDepth < depth)
+								{
+									depth = interpolatedDepth;
+									resultFragment.eyePos = prim.v[0].eyePos * u + prim.v[1].eyePos * v + prim.v[2].eyePos * w;
+									resultFragment.eyeNor = prim.v[0].eyeNor * u + prim.v[1].eyeNor * v + prim.v[2].eyeNor * w;
+								}
+
+								overdraw++;
+							}
+						}
+					}
+
+					// Jump to parent tile
+					if (i < numSubdivisions - 1)
+						tile = &dev_tile_header[tile->parentIndex];
+				}
+
+				resultFragment.overdraw = overdraw;
+
+				if (overdraw > 0)
+				{
+					// Fragment shader code
+					glm::vec3 toLightEye = glm::normalize(glm::vec3(1.f));
+					glm::vec3 R = glm::reflect(toLightEye, resultFragment.eyeNor);
+
+					float cosTheta = glm::dot(resultFragment.eyeNor, toLightEye) * .5f;
+					float ratio = glm::pow(1.0f - resultFragment.eyeNor.z, 2.f) * .35f;
+					float bounce = glm::max(0.f, -glm::dot(resultFragment.eyeNor, toLightEye)) * .35f;
+					float specular = glm::pow(R.z, 16.f) * .5f;
+					resultFragment.color = glm::vec3(cosTheta + ratio + bounce + specular);
+				}
+				
+				int fragIndex = y * width + x;
+				dev_fragmentBuffer[fragIndex] = resultFragment;
+			}
+		}
 	}
 }
 
+__global__
+void updateTiles(int numPrimitives, int w, int h, int tileSubdivisions, int baseTileSize, Primitive* dev_primitives, 
+	RenderTile * dev_tile_header, int * dev_tile_primitives, RenderSubdivision  * dev_subdivisions)
+{
+	int primitiveIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
 
+	if (primitiveIndex < numPrimitives)
+	{
+		Primitive & p = dev_primitives[primitiveIndex];
+		glm::vec4 v1 = p.v[0].pos;
+		glm::vec4 v2 = p.v[1].pos;
+		glm::vec4 v3 = p.v[2].pos;
+
+		// Get the AABB (with Z too, for early reject)
+		p.min = glm::min(glm::vec3(v1), glm::min(glm::vec3(v2), glm::vec3(v3)));
+		p.max = glm::max(glm::vec3(v1), glm::max(glm::vec3(v2), glm::vec3(v3)));
+
+		// Ignore primitives behind
+		if (p.max.z > 0.f)
+			return;
+
+		glm::vec2 screenMin = glm::vec2(p.min);
+		glm::vec2 screenMax = glm::vec2(p.max);
+		glm::vec2 screenSize = glm::abs(glm::vec2(p.max) - glm::vec2(p.min));
+
+		for (int i = 0; i < tileSubdivisions; ++i)
+		{
+			RenderSubdivision & subdiv = dev_subdivisions[i];
+
+			glm::vec2 tileSize = glm::vec2(subdiv.tileSize, subdiv.tileSize);
+			glm::ivec2 sizeAtResolution = glm::ceil(screenSize / tileSize);
+
+			int affectedTiles = glm::max(sizeAtResolution.x, sizeAtResolution.y);
+
+			// If the size of this triangle is comparable to the tile size, stop at this level
+			// Also stop if this is the last subdivisions (__very__ large triangles)
+			if (affectedTiles <= 6 || i == tileSubdivisions - 1)
+			{
+				int tileCount = dev_subdivisions[i].tileCount;
+				int tileOffset = dev_subdivisions[i].tileOffset;
+
+				// Make sure we don't go out of bounds for this level
+				glm::ivec2 from = glm::clamp(glm::floor(screenMin / tileSize), glm::vec2(0), glm::vec2(tileCount));
+				glm::ivec2 to = glm::clamp(glm::ceil(screenMax / tileSize), glm::vec2(0), glm::vec2(tileCount));
+
+				// Now write into the tile buffer
+				for (int y = from.y; y <= to.y; ++y)
+				{
+					for (int x = from.x; x <= to.x; ++x)
+					{
+						RenderTile & tile = dev_tile_header[tileOffset + (tileCount * y) + x];
+
+						// Get the list head and set this primitive
+						int lastIndex = atomicAdd(&tile.currentIndex, 1);
+
+						// We don't really care if the index goes above this point, we just care about not setting
+						// memory outside this array
+						if(lastIndex < tile.capacity)
+							dev_tile_primitives[tile.primitiveOffset + lastIndex] = primitiveIndex;
+					}
+				}
+
+				// We already found our level, don't do anything else
+				return;
+			}
+		}
+	}
+}
 
 static int curPrimitiveBeginId = 0;
 
 __global__ 
-void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) {
-
-	// index id
+void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_primitives, PrimitiveDevBufPointers primitive) 
+{
 	int iid = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (iid < numIndices) {
-
+	if (iid < numIndices) 
+	{
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
-
-
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 		// TODO: other primitive types (point, line)
 	}
-	
 }
 
-
-
 /**
  * Perform rasterization.
  */
@@ -719,12 +1009,30 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
 	
+	cudaMemset(dev_fragmentMutex, 0, width * height * sizeof(int));
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
-	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
 
+	// Clear tile indices
+	dim3 blockSizeTiles(64);
+	dim3 blockCountTiles((totalTiles - 1) / blockSizeTiles.x + 1);
+	clearTileIndices << <blockCountTiles, blockSizeTiles >> >(totalTiles, dev_tile_headers);
+
+	cudaDeviceSynchronize();
+
+	// Update tile data
+	dim3 numThreadsPerBlockTiles(128);
+	dim3 blockCountForPrimitives((totalNumPrimitives - 1) / numThreadsPerBlockTiles.x + 1);
+	updateTiles << <blockCountForPrimitives, numThreadsPerBlockTiles >> > (totalNumPrimitives, width, height, effectiveTileSubdivisions,
+		TILE_SIZE, dev_primitives, dev_tile_headers, dev_tile_primitives, dev_subdivisions);
 
+	cudaDeviceSynchronize();
+
+	// Rasterize tiles
+	int totalTiles = subdivisionData[0].tileCount * subdivisionData[0].tileCount;
+	dim3 blockCountForRasterization((totalTiles - 1) / numThreadsPerBlockTiles.x + 1);
+	rasterizeTiles << <blockCountForRasterization, numThreadsPerBlockTiles >> > (totalTiles, effectiveTileSubdivisions, width, height, dev_subdivisions, dev_tile_headers, dev_tile_primitives, dev_primitives, dev_fragmentBuffer, dev_fragmentMutex);
+
+	cudaDeviceSynchronize();
 
     // Copy depthbuffer colors into framebuffer
 	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
@@ -769,8 +1077,5 @@ void rasterizeFree() {
     cudaFree(dev_framebuffer);
     dev_framebuffer = NULL;
 
-	cudaFree(dev_depth);
-	dev_depth = NULL;
-
     checkCUDAError("rasterize Free");
 }