EggHatch-AI-Tutorial/04_data_pipeline.html at main · AustinZ21/EggHatch-AI-Tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Data Pipeline - EggHatch-AI Tutorial</title>
    <link rel="stylesheet" href="styles.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css">
</head>
<body>
    <div class="container">
        <aside class="sidebar">
            <div class="sidebar-header">
                <h2>EggHatch-AI</h2>
                <p>Tutorial</p>
            </div>
            <nav class="sidebar-nav">
                <ul>
                    <li><a href="index.html"><i class="fas fa-home"></i> Home</a></li>
                    <li><a href="01_user_interface.html"><i class="fas fa-desktop"></i> User Interface</a></li>
                    <li><a href="02_master_agent.html"><i class="fas fa-brain"></i> Master Agent</a></li>
                    <li><a href="03_llm_client.html"><i class="fas fa-comment-dots"></i> LLM Client</a></li>
                    <li class="active"><a href="04_data_pipeline.html"><i class="fas fa-database"></i> Data Pipeline</a></li>
                    <li><a href="05_sentiment_analysis.html"><i class="fas fa-smile"></i> Sentiment Analysis</a></li>
                    <li><a href="06_trend_analysis.html"><i class="fas fa-chart-line"></i> Trend Analysis</a></li>
                    <li><a href="07_agent_state.html"><i class="fas fa-toggle-on"></i> Agent State</a></li>
                    <li><a href="08_prompts.html"><i class="fas fa-quote-left"></i> Prompts</a></li>
                </ul>
            </nav>
            <div class="sidebar-footer">
                <a href="https://github.com/AustinZ21/EggHatch-AI" target="_blank"><i class="fab fa-github"></i> GitHub Repository</a>
            </div>
        </aside>
        <main class="content">
            <header>
                <h1>Chapter 4: Data Pipeline</h1>
            </header>
            <div class="content-body">
                <p>Welcome back to the EggHatch AI tutorial! In the last chapter, <a href="03_llm_client.html">LLM Client</a>, we learned how our system talks to the powerful AI model itself. The LLM Client is the messenger that sends instructions (prompts) and gets text back.</p>

                <p>But what does the AI <em>talk about</em>? For EggHatch AI to give you helpful advice on PC parts and tech, it needs information about products – like laptop specifications, prices, and what real customers think about them. This is where the <strong>Data Pipeline</strong> comes in.</p>

                <h2>What is the Data Pipeline?</h2>

                <div class="info-box">
                    <p>Imagine the Data Pipeline is like the <strong>prep station in a busy kitchen</strong>. Raw ingredients (our raw data like messy spreadsheets or text files of reviews) arrive here. The kitchen staff (the Data Pipeline) then washes, chops, measures, and prepares everything so it's ready for the chefs (the other agents) to use in their dishes (the final recommendations and analysis).</p>
                </div>

                <p>Its main job is to take raw, potentially messy data and transform it into a clean, structured format that the rest of the EggHatch AI system can easily understand and work with.</p>

                <h2>Why Do We Need a Data Pipeline?</h2>

                <p>Raw data is rarely perfect. It can have:</p>

                <ul>
                    <li><strong>Inconsistencies:</strong> Prices might be formatted differently, names might be spelled slightly wrong.</li>
                    <li><strong>Missing information:</strong> Some products might be missing certain specifications.</li>
                    <li><strong>Different formats:</strong> Data might come from different sources with different structures.</li>
                    <li><strong>Irrelevant details:</strong> There might be information we don't need for our analysis.</li>
                </ul>

                <p>The Data Pipeline solves these problems by:</p>

                <div class="component-grid">
                    <div class="component-card">
                        <i class="fas fa-broom"></i>
                        <h3>Cleaning</h3>
                        <p>Fixing errors, standardizing formats, and handling missing values</p>
                    </div>
                    <div class="component-card">
                        <i class="fas fa-filter"></i>
                        <h3>Filtering</h3>
                        <p>Removing irrelevant information and focusing on what matters</p>
                    </div>
                    <div class="component-card">
                        <i class="fas fa-random"></i>
                        <h3>Transforming</h3>
                        <p>Converting data into the right structure for analysis</p>
                    </div>
                    <div class="component-card">
                        <i class="fas fa-box-open"></i>
                        <h3>Packaging</h3>
                        <p>Organizing data in a consistent format for other components</p>
                    </div>
                </div>

                <h2>How the Data Pipeline Works in EggHatch AI</h2>

                <p>The Data Pipeline in EggHatch AI follows a series of steps to process data:</p>

                <div class="workflow-diagram">
                    <img src="data_pipeline_workflow.svg" alt="Data Pipeline Workflow" onerror="this.onerror=null; this.src='https://via.placeholder.com/800x250?text=Data+Pipeline+Workflow'">
                </div>

                <h2>The Data Pipeline Implementation</h2>

                <p>Let's look at a simplified version of the Data Pipeline code:</p>

                <div class="code-block">
                    <pre><code>
class DataPipeline:
    def __init__(self):
        # Configure data sources
        self.data_sources = {
            "products": "data/csv/new_egg_gaming_laptops.csv",
            "reviews": "data/reviews/"
        }

        # Set up data caches
        self.product_cache = {}
        self.review_cache = {}

        # Load initial data
        self._load_product_data()

    def _load_product_data(self):
        """Load product data from CSV file"""
        try:
            import pandas as pd

            # Read CSV file
            df = pd.read_csv(self.data_sources["products"])

            # Clean and transform data
            df = self._clean_product_data(df)

            # Convert to dictionary format for easy access
            self.product_cache = df.to_dict(orient='records')

            print(f"Loaded {len(self.product_cache)} products")
        except Exception as e:
            print(f"Error loading product data: {str(e)}")

    def _clean_product_data(self, df):
        """Clean and standardize product data"""
        # Standardize column names
        df.columns = [col.lower().replace(' ', '_') for col in df.columns]

        # Handle missing values
        df = df.fillna({
            'price': 0,
            'rating': 0,
            'num_reviews': 0
        })

        # Extract and standardize price
        df['price'] = df['price'].astype(str).str.replace('$', '').str.replace(',', '')
        df['price'] = pd.to_numeric(df['price'], errors='coerce')

        # Other cleaning steps...

        return df

    def get_product_by_id(self, product_id):
        """Retrieve a product by its ID"""
        for product in self.product_cache:
            if str(product.get('id')) == str(product_id):
                return product
        return None

    def get_products_by_filter(self, filters=None):
        """Get products matching certain criteria"""
        if not filters:
            return self.product_cache

        filtered_products = []
        for product in self.product_cache:
            matches = True
            for key, value in filters.items():
                if key in product:
                    # Handle range filters (e.g., price_min, price_max)
                    if key.endswith('_min') and product.get(key[:-4], 0) < value:
                        matches = False
                    elif key.endswith('_max') and product.get(key[:-4], 0) > value:
                        matches = False
                    # Handle exact matches
                    elif not key.endswith('_min') and not key.endswith('_max') and product.get(key) != value:
                        matches = False

            if matches:
                filtered_products.append(product)

        return filtered_products

    def get_reviews_for_product(self, product_id):
        """Get all reviews for a specific product"""
        # Check cache first
        if product_id in self.review_cache:
            return self.review_cache[product_id]

        # Load reviews from file
        import json
        import os

        review_file = os.path.join(self.data_sources["reviews"], f"laptop_{product_id}_reviews.json")

        if not os.path.exists(review_file):
            return []

        try:
            with open(review_file, 'r') as f:
                reviews = json.load(f)

            # Clean and standardize reviews
            cleaned_reviews = self._clean_reviews(reviews)

            # Cache for future use
            self.review_cache[product_id] = cleaned_reviews

            return cleaned_reviews
        except Exception as e:
            print(f"Error loading reviews for product {product_id}: {str(e)}")
            return []
                    </code></pre>
                </div>

                <h2>Key Features of the Data Pipeline</h2>

                <div class="principles-grid">
                    <div class="principle-card">
                        <i class="fas fa-sync"></i>
                        <h3>Caching</h3>
                        <p>Stores processed data in memory for quick access</p>
                    </div>
                    <div class="principle-card">
                        <i class="fas fa-search"></i>
                        <h3>Filtering</h3>
                        <p>Provides flexible ways to search for specific products</p>
                    </div>
                    <div class="principle-card">
                        <i class="fas fa-link"></i>
                        <h3>Relationships</h3>
                        <p>Connects products with their reviews and other related data</p>
                    </div>
                    <div class="principle-card">
                        <i class="fas fa-shield-alt"></i>
                        <h3>Error Handling</h3>
                        <p>Gracefully handles missing files or corrupted data</p>
                    </div>
                </div>

                <h2>Data Sources in EggHatch AI</h2>

                <p>EggHatch AI uses several data sources to provide comprehensive information:</p>

                <ul>
                    <li><strong>Product Specifications:</strong> Details about laptops including processor, memory, storage, graphics, etc.</li>
                    <li><strong>Pricing Information:</strong> Current and historical prices from retailers</li>
                    <li><strong>Customer Reviews:</strong> Text reviews and ratings from actual users</li>
                    <li><strong>Technical Benchmarks:</strong> Performance measurements for different tasks</li>
                </ul>

                <p>All of this data flows through the Data Pipeline, getting cleaned and organized before being used by the specialized agents for analysis.</p>

                <h2>Next Steps</h2>

                <p>Now that you understand how the Data Pipeline prepares information for analysis, let's move on to <a href="05_sentiment_analysis.html">Chapter 5: Sentiment Analysis Agent</a>, where we'll explore how EggHatch AI understands the emotions and opinions in customer reviews.</p>
            </div>
            <footer>
                <p>Generated with <a href="https://github.com/The-Pocket/Tutorial-Codebase-Knowledge">AI Codebase Knowledge Builder</a></p>
            </footer>
        </main>
    </div>
    <script src="script.js"></script>
</body>
</html>