You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/quickstart/user_profile.md
+33-26Lines changed: 33 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,20 +5,20 @@ sidebar_position: 4
5
5
6
6
# Real-Time User Profile
7
7
8
-
This tutorial demonstrates a real-time user profiling workflow using two core Apache Fluss features: the **Auto-Increment Column** and the **Aggregation Merge Engine**.
8
+
This tutorial demonstrates how to build a real-time user profiling system using two core Apache Fluss features: the **Auto-Increment Column** and the **Aggregation Merge Engine**. You will learn how to automatically map high-cardinality string identifiers (like emails) to compact integer UIDs, and accumulate user click metrics directly in the storage layer keeping the Flink job entirely stateless.
9
9
10
10
## How the System Works
11
11
12
12
### Core Concepts
13
13
14
14
-**Identity Mapping**: Incoming email strings are automatically mapped to compact `INT` UIDs using Fluss's auto-increment column, no manual ID management required.
15
-
-**Storage-Level Aggregation**: Click counts are accumulated directly in the Fluss TabletServers via The Aggregation Merge Engine sums the clicks at the storage layer, with no windowing or state in Flink required.
15
+
-**Storage-Level Aggregation**: Click counts are accumulated directly in the Fluss TabletServers via the Aggregation Merge Engine. The Flink job remains stateless and lightweight.
16
16
17
17
### Data Flow
18
18
19
-
1.**Ingestion**: Raw click events arrive with an email address, a click count, and a profile group.
19
+
1.**Ingestion**: Raw click events arrive with an email addressand a click count.
20
20
2.**Mapping**: A Flink lookup join against `user_dict` resolves the email to a UID. If the email is new, the `insert-if-not-exists` hint instructs Fluss to generate a new UID automatically.
21
-
3.**Aggregation**: Each event's click count is written to `user_profiles`. The Aggregation Merge Engine sums the clicks at the storage layer, with no windowing or state in Flink required.
21
+
3.**Aggregation**: The resolved UID becomes the primary key in `user_profiles`. Each event's click count is summed at the storage layer via the Aggregation Merge Engine, no windowing or state in Flink required.
22
22
23
23
## Environment Setup
24
24
@@ -177,7 +177,7 @@ USE CATALOG fluss_catalog;
177
177
178
178
## Step 2: Create the User Dictionary Table
179
179
180
-
Create the `user_dict` table to map emails to UIDs. The `auto-increment.fields` property instructs Fluss to automatically generate a unique `INT` UID for every new email it receives.
180
+
Create the `user_dict` table to map email addresses to integer UIDs. The `auto-increment.fields` property instructs Fluss to automatically assign a unique `INT` UID for every new email it receives.
181
181
182
182
```sql
183
183
CREATE TABLE user_dict (
@@ -191,13 +191,13 @@ CREATE TABLE user_dict (
191
191
192
192
## Step 3: Create the Aggregated Profile Table
193
193
194
-
Create the `user_profiles` table using the **Aggregation Merge Engine**. The `sum` aggregator on `total_clicks` means every incoming click count is accumulated directly at the storage layer, so the Flink job does not need to maintain any state.
194
+
Create the `user_profiles` table using the **Aggregation Merge Engine**. Each user's UID is the primary key, and `total_clicks` accumulates their click activity directly at the storage layer via the `sum` aggregator.
Create a temporary source table to simulate raw click events using the Faker connector.
210
210
211
+
:::note
212
+
Java Faker's `numberBetween(min, max)` treats `max` as exclusive. The expressions below are set to produce click counts of 1–10 and a pool of 100 distinct simulated email users.
Now run the pipeline. The `lookup.insert-if-not-exists` hint ensures that if an email is not found in `user_dict`, Fluss automatically generates a new UID for it on the fly.
227
-
228
-
Although this minimal quickstart does not use the generated `uid` in the final aggregation, the lookup join is still important because it demonstrates how Fluss automatically assigns and persists stable integer IDs for new email identifiers on first encounter.
228
+
Now run the pipeline. The `lookup.insert-if-not-exists` hint ensures that if an email is not found in `user_dict`, Fluss generates a new `uid` for it automatically. The resolved `uid` becomes the primary key of `user_profiles`, making the dictionary mapping the central link between the two tables.
Open a **second terminal**, change into the same working directory, re-run the export commands, and launch another SQL Client session to query the results while the pipeline runs.
243
+
Open a **second terminal**, re-run the export commands, and launch another SQL Client session to query results while the pipeline runs.
244
244
245
245
```shell
246
246
cd fluss-user-profile
@@ -257,6 +257,7 @@ CREATE CATALOG fluss_catalog WITH (
257
257
'bootstrap.servers' = 'coordinator-server:9123'
258
258
);
259
259
```
260
+
260
261
```sql
261
262
USE CATALOG fluss_catalog;
262
263
```
@@ -268,18 +269,18 @@ SET 'sql-client.execution.result-mode' = 'tableau';
268
269
```
269
270
270
271
```sql
271
-
SELECT profile_id, total_clicks FROM user_profiles;
272
+
SELECT uid, total_clicks FROM user_profiles;
272
273
```
273
274
274
-
You should see 5 rows (one per profile group) with `total_clicks` increasing in real time as new events arrive.
275
+
You should see rows appearing for each new user, with `total_clicks` accumulating in real time as more events arrive for the same email.
275
276
276
-
To verify the dictionary mapping is working:
277
+
To verify that email-to-UID mapping is working correctly:
277
278
278
279
```sql
279
280
SELECT * FROM user_dict LIMIT 10;
280
281
```
281
282
282
-
Each email should have a compact `INT` uid automatically assigned by Fluss.
283
+
Each email should have a unique compact `INT` uid automatically assigned by Fluss.
283
284
284
285
## Clean Up
285
286
@@ -289,6 +290,12 @@ Exit the SQL Client by typing `exit;`, then stop the Docker containers.
289
290
docker compose down -v
290
291
```
291
292
293
+
## Architectural Benefits
294
+
295
+
- **Stateless Flink Jobs:** Offloading both the identity dictionary and the click aggregation to Fluss makes the Flink job lightweight, with fast checkpoints and minimal recovery time.
296
+
- **Compact Storage:** Using auto-incremented `INT` UIDs instead of raw email strings reduces memory and storage footprint significantly.
297
+
- **Exactly-Once Accuracy:** The **Undo Recovery** mechanism in the Fluss Flink connector ensures that replayed data during failovers does not result in double-counting.
298
+
292
299
## What's Next?
293
300
294
301
This quickstart demonstrates the core mechanics. For a deeper dive into real-time user profiling with bitmap-based unique visitor counting using the `rbm64` aggregator, see the [Real-Time Profiles blog post](/blog/realtime-profiles-fluss).
0 commit comments