|
| 1 | +# Feature Engineering Tools Comparison Benchmark |
| 2 | +## Overview |
| 3 | +This benchmark compares FeatCopilot with other popular feature engineering libraries across multiple datasets using FLAML AutoML for model training. |
| 4 | + |
| 5 | +### Tools Compared |
| 6 | +| Tool | Description | |
| 7 | +|------|-------------| |
| 8 | +| baseline | No feature engineering (raw features only) | |
| 9 | +| featcopilot | FeatCopilot - Multi-engine auto feature engineering | |
| 10 | +| featuretools | Featuretools - Deep Feature Synthesis | |
| 11 | +| autofeat | autofeat - Automatic feature generation with L1 selection | |
| 12 | + |
| 13 | +## 1. Win Rate (Best Score Per Dataset) |
| 14 | +| Tool | Wins | Datasets Tested | Win Rate | |
| 15 | +|------|------|-----------------|----------| |
| 16 | +| featcopilot | 8 | 10 | 80% 🏆 | |
| 17 | +| autofeat | 2 | 5 | 40% | |
| 18 | +| featuretools | 0 | 10 | 0% | |
| 19 | + |
| 20 | +## 2. Average Improvement Over Baseline |
| 21 | +| Tool | Avg Improvement | Min Improvement | Max Improvement | Positive % | |
| 22 | +|------|-----------------|-----------------|-----------------|------------| |
| 23 | +| featcopilot | +1.89% | -0.02% | +5.78% | 80% | |
| 24 | +| autofeat | +1.46% | -0.54% | +4.38% | 60% | |
| 25 | +| featuretools | -2.71% | -16.72% | +2.72% | 20% | |
| 26 | + |
| 27 | +## 3. Feature Engineering Speed |
| 28 | +| Tool | Avg FE Time | Median FE Time | Speedup vs Slowest | |
| 29 | +|------|-------------|----------------|--------------------| |
| 30 | +| featuretools | 0.11s ⚡ | 0.11s | 433x | |
| 31 | +| featcopilot | 1.90s | 1.81s | 25x | |
| 32 | +| autofeat | 48.09s | 40.52s | 1x | |
| 33 | + |
| 34 | +## 4. Dataset Coverage |
| 35 | +| Tool | Successful | Errored | Timed Out | Coverage | |
| 36 | +|------|-----------|---------|-----------|----------| |
| 37 | +| featcopilot | 10 | 0 | 0 | 100% 🏆 | |
| 38 | +| featuretools | 10 | 0 | 0 | 100% 🏆 | |
| 39 | +| autofeat | 5 | 5 | 0 | 50% | |
| 40 | + |
| 41 | +## 5. Composite Score (Overall Ranking) |
| 42 | +Composite score combines: accuracy improvement (40%), win rate (30%), speed (15%), coverage (15%). |
| 43 | + |
| 44 | +| Rank | Tool | Accuracy (40%) | Win Rate (30%) | Speed (15%) | Coverage (15%) | Composite | |
| 45 | +|------|------|----------------|----------------|-------------|----------------|----------| |
| 46 | +| 🥇 1 | **featcopilot** | +1.89% | 8/10 | 1.9s | 10/10 | **0.606** | |
| 47 | +| 🥈 2 | **featuretools** | -2.71% | 0/10 | 0.1s | 10/10 | **0.397** | |
| 48 | +| 🥉 3 | **autofeat** | +1.46% | 2/5 | 48.1s | 5/10 | **0.351** | |
| 49 | + |
| 50 | +## Detailed Results |
| 51 | + |
| 52 | +### complex_regression |
| 53 | +**Task**: regression |
| 54 | + |
| 55 | +| Tool | R² Score | Features | FE Time | Status | |
| 56 | +|------|----------|----------|---------|--------| |
| 57 | +| baseline | 0.8691 | 15.0 | 0.00s | ✅ | |
| 58 | +| featcopilot | 0.8825 | 20.0 | 3.17s | ✅ | |
| 59 | +| featuretools | 0.8490 | 100.0 | 0.10s | ✅ | |
| 60 | +| autofeat | **0.8988** 🏆 | 31.0 | 40.52s | ✅ | |
| 61 | + |
| 62 | +### polynomial_regression |
| 63 | +**Task**: regression |
| 64 | + |
| 65 | +| Tool | R² Score | Features | FE Time | Status | |
| 66 | +|------|----------|----------|---------|--------| |
| 67 | +| baseline | 0.9026 | 12.0 | 0.00s | ✅ | |
| 68 | +| featcopilot | 0.9342 | 19.0 | 2.90s | ✅ | |
| 69 | +| featuretools | 0.8843 | 100.0 | 0.09s | ✅ | |
| 70 | +| autofeat | **0.9421** 🏆 | 28.0 | 27.63s | ✅ | |
| 71 | + |
| 72 | +### xor_classification |
| 73 | +**Task**: classification |
| 74 | + |
| 75 | +| Tool | Accuracy | Features | FE Time | Status | |
| 76 | +|------|----------|----------|---------|--------| |
| 77 | +| baseline | 0.8480 | 20.0 | 0.00s | ✅ | |
| 78 | +| featcopilot | **0.8540** 🏆 | 24.0 | 2.27s | ✅ | |
| 79 | +| featuretools | 0.8140 | 100.0 | 0.13s | ✅ | |
| 80 | +| autofeat | nan | nan | 1290.00s | ❌ FE timeout (>120s) | |
| 81 | + |
| 82 | +### complex_classification |
| 83 | +**Task**: classification |
| 84 | + |
| 85 | +| Tool | Accuracy | Features | FE Time | Status | |
| 86 | +|------|----------|----------|---------|--------| |
| 87 | +| baseline | 0.7350 | 15.0 | 0.00s | ✅ | |
| 88 | +| featcopilot | **0.7775** 🏆 | 21.0 | 1.69s | ✅ | |
| 89 | +| featuretools | 0.7550 | 100.0 | 0.10s | ✅ | |
| 90 | +| autofeat | nan | nan | 624.00s | ❌ FE timeout (>120s) | |
| 91 | + |
| 92 | +### interaction_classification |
| 93 | +**Task**: classification |
| 94 | + |
| 95 | +| Tool | Accuracy | Features | FE Time | Status | |
| 96 | +|------|----------|----------|---------|--------| |
| 97 | +| baseline | 0.7925 | 12.0 | 0.00s | ✅ | |
| 98 | +| featcopilot | **0.8100** 🏆 | 17.0 | 1.69s | ✅ | |
| 99 | +| featuretools | 0.7725 | 100.0 | 0.11s | ✅ | |
| 100 | +| autofeat | nan | nan | 360.00s | ❌ FE timeout (>120s) | |
| 101 | + |
| 102 | +### titanic |
| 103 | +**Task**: classification |
| 104 | + |
| 105 | +| Tool | Accuracy | Features | FE Time | Status | |
| 106 | +|------|----------|----------|---------|--------| |
| 107 | +| baseline | 0.8603 | 7.0 | 0.00s | ✅ | |
| 108 | +| featcopilot | **0.8603** 🏆 | 10.0 | 0.56s | ✅ | |
| 109 | +| featuretools | **0.8603** 🏆 | 91.0 | 0.11s | ✅ | |
| 110 | +| autofeat | **0.8603** 🏆 | 17.0 | 88.36s | ✅ | |
| 111 | + |
| 112 | +### house_prices |
| 113 | +**Task**: regression |
| 114 | + |
| 115 | +| Tool | R² Score | Features | FE Time | Status | |
| 116 | +|------|----------|----------|---------|--------| |
| 117 | +| baseline | 0.9958 | 14.0 | 0.00s | ✅ | |
| 118 | +| featcopilot | **0.9972** 🏆 | 16.0 | 1.98s | ✅ | |
| 119 | +| featuretools | 0.9964 | 100.0 | 0.15s | ✅ | |
| 120 | +| autofeat | 0.9965 | 37.0 | 54.83s | ✅ | |
| 121 | + |
| 122 | +### credit_risk |
| 123 | +**Task**: classification |
| 124 | + |
| 125 | +| Tool | Accuracy | Features | FE Time | Status | |
| 126 | +|------|----------|----------|---------|--------| |
| 127 | +| baseline | 0.8750 | 10.0 | 0.00s | ✅ | |
| 128 | +| featcopilot | **0.8925** 🏆 | 16.0 | 1.44s | ✅ | |
| 129 | +| featuretools | 0.8575 | 100.0 | 0.09s | ✅ | |
| 130 | +| autofeat | nan | nan | 411.00s | ❌ FE timeout (>120s) | |
| 131 | + |
| 132 | +### bike_sharing |
| 133 | +**Task**: regression |
| 134 | + |
| 135 | +| Tool | R² Score | Features | FE Time | Status | |
| 136 | +|------|----------|----------|---------|--------| |
| 137 | +| baseline | 0.9795 | 10.0 | 0.00s | ✅ | |
| 138 | +| featcopilot | **0.9793** 🏆 | 12.0 | 1.94s | ✅ | |
| 139 | +| featuretools | 0.9770 | 100.0 | 0.14s | ✅ | |
| 140 | +| autofeat | 0.9742 | 40.0 | 29.09s | ✅ | |
| 141 | + |
| 142 | +### customer_churn |
| 143 | +**Task**: classification |
| 144 | + |
| 145 | +| Tool | Accuracy | Features | FE Time | Status | |
| 146 | +|------|----------|----------|---------|--------| |
| 147 | +| baseline | 0.7325 | 10.0 | 0.00s | ✅ | |
| 148 | +| featcopilot | **0.7550** 🏆 | 12.0 | 1.35s | ✅ | |
| 149 | +| featuretools | 0.6100 | 100.0 | 0.09s | ✅ | |
| 150 | +| autofeat | nan | nan | 300.00s | ❌ FE timeout (>120s) | |
| 151 | + |
| 152 | +## Conclusion |
| 153 | +**Overall Winner: featcopilot** (composite score: 0.606) |
| 154 | + |
| 155 | +FeatCopilot achieves the best composite score with: highest win rate, broadest dataset coverage. |
0 commit comments