Skip to content

Commit 74f184f

Browse files
authored
enhancement: hashset (#86)
* feat: fix bugs and improve hashset docs * fix: getValue() for linkedlist used in hashset implementation
1 parent d6cea5b commit 74f184f

5 files changed

Lines changed: 310 additions & 82 deletions

File tree

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Hash Table (HashSet / HashMap)
2+
3+
## Background
4+
5+
A **hash table** maps keys to values using a **hash function** that converts keys into array indices. This enables `O(1)` expected-time operations for insert, lookup, and delete.
6+
7+
```
8+
key → hash(key) → index → bucket
9+
```
10+
11+
The core challenge is **collision handling** - when two different keys hash to the same index.
12+
13+
### Hash Function Requirements
14+
15+
A good hash function should:
16+
1. **Deterministic**: Same key always produces same hash
17+
2. **Uniform distribution**: Keys spread evenly across buckets
18+
3. **Fast to compute**: `O(1)` time
19+
20+
Common approach: **division method** `h(k) = k mod m` where `m` is the number of buckets.
21+
22+
## Collision Resolution Strategies
23+
24+
| Strategy | How it works | Collision handling |
25+
|----------|--------------|-------------------|
26+
| **[Chaining](chaining/)** | Each bucket stores a linked list | Append to list at bucket |
27+
| **[Open Addressing](openAddressing/)** | All elements stored in array | Probe for next empty slot |
28+
29+
### Chaining
30+
31+
Each bucket contains a linked list of elements that hash to that index.
32+
33+
```
34+
Bucket 0: → [A] → [D] → null
35+
Bucket 1: → [B] → null
36+
Bucket 2: → [C] → [E] → [F] → null
37+
```
38+
39+
**Pros**: Simple, never "full", degrades gracefully
40+
**Cons**: Extra memory for pointers, cache-unfriendly
41+
42+
### Open Addressing
43+
44+
All elements stored directly in the array. On collision, probe for next available slot.
45+
46+
```
47+
[A] [B] [_] [C] [D] [_] [E] [_]
48+
49+
collision with A → probe to here
50+
```
51+
52+
**Pros**: Cache-friendly, no extra pointers
53+
**Cons**: Clustering, must resize when full, deletion is tricky (tombstones)
54+
55+
## Complexity Analysis
56+
57+
| Operation | Expected | Worst (Chaining) | Worst (OA) |
58+
|-----------|----------|------------------|------------|
59+
| `add()` | `O(1)` | `O(n)` | `O(n)` |
60+
| `contains()` | `O(1)` | `O(n)` | `O(n)` |
61+
| `remove()` | `O(1)` | `O(n)` | `O(n)` |
62+
63+
**Space**: `O(n)` for n elements
64+
65+
Expected `O(1)` assumes **Simple Uniform Hashing Assumption (SUHA)**: each key is equally likely to hash to any bucket, independent of other keys.
66+
67+
Worst case occurs when all keys hash to the same bucket (degenerate case).
68+
69+
## Load Factor and Resizing
70+
71+
The **load factor** `α = n/m` (elements/buckets) measures how full the table is.
72+
73+
| Strategy | Typical threshold | Why resize? |
74+
|----------|------------------|-------------|
75+
| **Chaining** | α > 0.75 (recommended) | Performance optimization - lists get long |
76+
| **Open Addressing** | α > 0.75 (mandatory) | **Must resize** - table fills up, probing degrades |
77+
78+
**Key insight**: Resizing is **mandatory** for open addressing (table becomes full), but merely an **optimization** for chaining (can always append to lists).
79+
80+
## Real-World Implementations
81+
82+
### Java's HashMap (Chaining with Treeification)
83+
84+
Java uses **chaining** with a clever optimization:
85+
1. Initially: linked list per bucket
86+
2. When bucket exceeds 8 elements: convert to **Red-Black Tree**
87+
3. When bucket shrinks below 6: convert back to linked list
88+
89+
```
90+
Bucket with few elements: → [A] → [B] → [C] O(n) search
91+
Bucket with many elements: Red-Black Tree O(log n) search
92+
```
93+
94+
This bounds worst-case lookup to `O(log n)` instead of `O(n)`.
95+
96+
### Python's dict (Open Addressing with Perturbation)
97+
98+
Python uses **open addressing** with a sophisticated probing strategy:
99+
1. Primary hash determines initial slot
100+
2. On collision: **perturbed probing** using bits from full hash
101+
3. Probe sequence: `j = ((5*j) + 1 + perturb) mod 2^n`
102+
103+
This achieves better distribution than linear/quadratic probing while maintaining cache efficiency.
104+
105+
| Language | Strategy | Collision in bucket | Load factor |
106+
|----------|----------|---------------------|-------------|
107+
| **Java** | Chaining | LinkedList → RB-Tree | 0.75 |
108+
| **Python** | Open Addressing | Perturbed probing | 0.67 |
109+
110+
## HashMap vs HashSet
111+
112+
The **hash table** (HashMap) is the core data structure. Everything discussed above - hashing, collision resolution, load factors - describes how to build a HashMap.
113+
114+
A **HashSet** is typically just a thin wrapper around HashMap:
115+
116+
```java
117+
class HashSet<T> {
118+
private HashMap<T, Object> map = new HashMap<>();
119+
private static final Object PRESENT = new Object(); // dummy value
120+
121+
public boolean add(T key) {
122+
return map.put(key, PRESENT) == null;
123+
}
124+
125+
public boolean contains(T key) {
126+
return map.containsKey(key);
127+
}
128+
}
129+
```
130+
131+
So when implementing a "HashSet" from scratch, you're really implementing a HashMap that ignores values:
132+
133+
| HashSet | HashMap |
134+
|---------|---------|
135+
| `bucket[i] = key` | `bucket[i] = Entry(key, value)` |
136+
| `add(key)` | `put(key, value)` |
137+
| `contains(key)` | `containsKey(key)` |
138+
139+
**Interview tip:** If asked to implement HashSet, you can mention it's typically backed by HashMap with a dummy value. The interesting work is in the hash table mechanics (collision resolution, resizing), not the Set vs Map distinction.
140+
141+
## Notes
142+
143+
1. **Hash function quality matters**: A poor hash function causes clustering, degrading `O(1)` to `O(n)`. Java's `String.hashCode()` uses: `s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]`.
144+
145+
2. **Prime table sizes**: Using a prime number of buckets helps distribute keys more uniformly with the division method.
146+
147+
3. **Immutable keys**: Keys should not change after insertion. If `hashCode()` changes, the element becomes "lost" - it's in the wrong bucket.
148+
149+
4. **equals/hashCode contract**: If `a.equals(b)` then `a.hashCode() == b.hashCode()`. Violating this breaks hash tables.
150+
151+
5. **Elastic Hashing (2021)**: Andrew Yao conjectured in 1985 that for open-addressing hash tables, you can't do better than uniform probing - the fuller the table, the worse performance gets (exponentially so). [Elastic hashing](https://joshtuddenham.dev/blog/hashmaps/) challenges this: it can fill a table to `(1-δ)` capacity (e.g., 99% full with `δ=0.01`) while achieving amortized `O(1)` expected probes and worst-case `O(log δ⁻¹)` expected probes. This is a significant theoretical advancement for open addressing.
152+
153+
## Applications
154+
155+
| Use Case | Why Hash Table? |
156+
|----------|-----------------|
157+
| Database indexing | `O(1)` lookup by key |
158+
| Caching (LRU, etc.) | Fast key-based retrieval |
159+
| Counting frequencies | `O(1)` increment per element |
160+
| Detecting duplicates | `O(1)` membership test |
161+
| Symbol tables (compilers) | Fast variable/function lookup |
162+
163+
**Interview tip:** When you need `O(1)` lookup/insert/delete by key, hash table is usually the answer. Common patterns: two-sum (complement lookup), anagram grouping (canonical key), frequency counting.

src/main/java/dataStructures/hashSet/chaining/HashSet.java

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,9 @@ public boolean isEmpty() {
8888
* @return the bucket to add the element to.
8989
*/
9090
private int hashFunction(T element) {
91-
return Math.abs(element.hashCode() % NUMBER_OF_BUCKETS);
91+
// Use bitwise AND with 0x7FFFFFFF to clear sign bit instead of Math.abs()
92+
// Math.abs(Integer.MIN_VALUE) returns Integer.MIN_VALUE (negative)
93+
return (element.hashCode() & 0x7FFFFFFF) % NUMBER_OF_BUCKETS;
9294
}
9395

9496
/**
@@ -155,8 +157,8 @@ public boolean remove(T element) {
155157
public List<T> toList() {
156158
List<T> outputList = new ArrayList<>();
157159
for (LinkedList<T> bucket : buckets) {
158-
while (bucket.size() != 0) {
159-
outputList.add(bucket.pop());
160+
for (int i = 0; i < bucket.size(); i++) {
161+
outputList.add(bucket.get(i).getValue());
160162
}
161163
}
162164
return outputList;
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# HashSet (Chaining)
2+
3+
## Background
4+
5+
**Chaining** resolves hash collisions by storing all elements that hash to the same bucket in a linked list.
6+
7+
```
8+
hash(key) → bucket index → append to list at bucket
9+
10+
Bucket 0: → [Alice] → [Dave] → null (both hash to 0)
11+
Bucket 1: → [Bob] → null
12+
Bucket 2: → [Carol] → [Eve] → null
13+
Bucket 3: → null (empty)
14+
```
15+
16+
See [parent README](../README.md) for comparison with open addressing.
17+
18+
## How It Works
19+
20+
### Add
21+
1. Compute `bucket = hash(key) % m`
22+
2. Search linked list at `buckets[bucket]` for duplicates
23+
3. If not found, insert at front of list
24+
25+
### Contains
26+
1. Compute bucket index
27+
2. Linear search through linked list
28+
3. Return true if found
29+
30+
### Remove
31+
1. Compute bucket index
32+
2. Search list for element
33+
3. Remove node from linked list
34+
35+
## Complexity Analysis
36+
37+
| Operation | Expected | Worst |
38+
|-----------|----------|-------|
39+
| `add()` | `O(1)` | `O(n)` |
40+
| `contains()` | `O(1)` | `O(n)` |
41+
| `remove()` | `O(1)` | `O(n)` |
42+
43+
**Expected time**: `O(1 + α)` where `α = n/m` is the load factor.
44+
45+
Under SUHA, each bucket has `α` elements on average, so list traversal is `O(α)`. With `m = Θ(n)`, we get `α = O(1)`.
46+
47+
**Worst case**: All n elements hash to the same bucket → `O(n)` list traversal.
48+
49+
## Notes
50+
51+
1. **No resizing required** (unlike open addressing): Lists can grow indefinitely. However, performance degrades as lists get longer, so resizing is still recommended.
52+
53+
2. **Our implementation**: Uses fixed 256 buckets without resizing. For production use, resize when `α > 1`.
54+
55+
3. **Java's approach**: Java HashMap starts with linked lists, then converts to Red-Black Trees when a bucket exceeds 8 elements. This bounds worst-case to `O(log n)`.
56+
57+
4. **Memory overhead**: Each node requires extra pointer(s) for the linked list structure. This also hurts cache locality compared to open addressing.
58+
59+
5. **Deletion is simple**: Just remove the node from the linked list. No tombstones needed (unlike open addressing).
60+
61+
**Interview tip:** Chaining is simpler to implement correctly than open addressing. When asked to implement a hash table from scratch in an interview, chaining is often the safer choice.

0 commit comments

Comments
 (0)