Skip to content

Commit e40c7e2

Browse files
Pro EngelsPro Engels
authored andcommitted
vault backup: 2026-02-15 20:27:01
1 parent e7be63c commit e40c7e2

5 files changed

Lines changed: 41 additions & 22 deletions

_posts/FireShot Capture 020 - OpenClaw Control - [127.0.0.1].webp renamed to _asset/FireShot Capture 020 - OpenClaw Control - [127.0.0.1].webp

File renamed without changes.
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
title: Data Engineering for Large Models
3+
tags:
4+
- LLM
5+
- dataEngineering
6+
date: 2026-02-14
7+
toc: true
8+
toc_sticky: true
9+
---
10+
11+
# Data Engineering for Large Models: Architecture, Algorithms & Projects
12+
13+
## Introduction
14+
15+
[](https://github.com/datascale-ai/data_engineering_book/blob/main/README_en.md#introduction)
16+
17+
> _"Data is the new oil, but only if you know how to refine it."_
18+
19+
In the era of large models, **data quality determines the upper bound of model performance**. Yet systematic resources on LLM data engineering remain extremely scarce — most teams are still learning by trial and error.
20+
21+
This book is designed to fill that gap. We systematically cover the complete technical stack from **pre-training data cleaning** to **multimodal alignment**, from **RAG retrieval augmentation** to **synthetic data generation**, including:
22+
23+
- 🧹 **Pre-training Data Engineering**: Extracting high-quality corpora from massive noisy data sources like Common Crawl
24+
- 🖼️ **Multimodal Data Processing**: Collection, cleaning, and alignment of image-text pairs, video, and audio data
25+
- 🎯 **Alignment Data Construction**: Automated generation of SFT instruction data, RLHF preference data, and CoT reasoning data
26+
- 🔍 **RAG Data Pipeline**: Enterprise-grade document parsing, semantic chunking, and multimodal retrieval
27+
28+
Beyond in-depth theoretical explanations, the book includes **5 end-to-end capstone projects** with runnable code and detailed architecture designs for hands-on learning.
29+
30+
**Read Online**: [https://datascale-ai.github.io/data_engineering_book/en/](https://datascale-ai.github.io/data_engineering_book/en/)
31+
32+
33+
34+
![](../_asset/2026-02-14Data%20Engineering%20for%20Large%20Models-1771026584018.webp)
35+
36+
37+
## Links
38+
39+
https://github.com/datascale-ai/data_engineering_book/blob/main/README_en.md
40+

_posts/2026-02-14-openclaw-log-example.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
## Log - Ringeltaube
55

6-
![](./FireShot%20Capture%20020%20-%20OpenClaw%20Control%20-%20[127.0.0.1].webp)
6+
![](../_asset/FireShot%20Capture%20020%20-%20OpenClaw%20Control%20-%20[127.0.0.1].webp)
77
## Log
88

99
![](../_asset/2026-02-14-openclaw-log-example-1771067525473.webp)

_posts/2026-02-14Data Engineering for Large Models.md

Lines changed: 0 additions & 21 deletions
This file was deleted.

_posts/image.webp

Whitespace-only changes.

0 commit comments

Comments
 (0)