up

liganega · liganega · commit 6db148047c8f · 2026-03-16T00:07:35.000+09:00
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,5 @@
 # MyST build outputs
 _build
+
+# 저장된 모델 무시
+*.pkl
diff --git a/end2end_ml_project.ipynb b/end2end_ml_project.ipynb
@@ -2676,7 +2676,8 @@
     "다음 할 일은 모델을 실전에 투입했을 때의 모델 성능 예측하기이며,\n",
     "여기에 테스트셋을 활용한다.\n",
     "\n",
-    "아래 코드는 테스트셋에 대한 최종 모델의 RMSE를 계산한다.\n",
+    "아래 코드는 \n",
+    "랜덤 탐색으로 확인된 최적의 모델의 테스트셋에 대한 RMSE를 계산한다.\n",
     "테스트셋 또한 입력 데이터셋과 타깃셋으로 먼저 구분한 다음에 모델 예측에 활용됨에 주의한다."
    ]
   },
diff --git a/notebooks/code-end2end_ml_project.ipynb b/notebooks/code-end2end_ml_project.ipynb
@@ -27554,9 +27554,16 @@
     "## 최적 모델 활용 및 평가"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "랜덤 탐색으로 확인된 최적의 모델을 활용하여 최종 평가를 진행한다."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 133,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -27570,16 +27577,28 @@
     "### 테스트셋 활용"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "이제 더 이상의 훈련을 진행할 필요가 없을 정도로 훈련된 모델의 성능에\n",
+    "만족한다고 가정하자.\n",
+    "다음 할 일은 모델을 실전에 투입했을 때의 모델 성능 예측하기이며,\n",
+    "여기에 테스트셋을 활용한다.\n",
+    "테스트셋 또한 입력 데이터셋과 타깃셋으로 먼저 구분되어야 하고,\n",
+    "모델 자체에 전처리 기능이 포함되어 있음에 주의한다."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 145,
+   "execution_count": 134,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "41445.533268606625\n"
+      "41422.168800999665\n"
      ]
     }
    ],
@@ -27597,14 +27616,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can compute a 95% confidence interval for the test RMSE using SciPy's `bootstrap()` function:"
+    "SciPy의 `bootstrap()` 함수를 사용하여 테스트 RMSE에 대한 \n",
+    "95% 신뢰구간<font size='2'>confidence interval</font>를 계산하면 다음과 같다.\n",
+    "신뢰구간은 줄여서 CI라 한다."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 146,
+   "execution_count": 135,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "95% CI for RMSE: (39487.5429, 43654.0546)\n"
+     ]
+    }
+   ],
    "source": [
     "from scipy.stats import bootstrap\n",
     "\n",
@@ -27615,111 +27644,67 @@
     "squared_errors = (final_predictions - y_test) ** 2\n",
     "boot_result = bootstrap([squared_errors], rmse, confidence_level=confidence,\n",
     "                        random_state=42)\n",
-    "rmse_lower, rmse_upper = boot_result.confidence_interval"
+    "rmse_lower, rmse_upper = boot_result.confidence_interval\n",
+    "print(f\"95% CI for RMSE: ({rmse_lower:.4f}, {rmse_upper:.4f})\")"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 147,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "95% CI for RMSE: (39520.9572, 43701.7681)\n"
-     ]
-    }
-   ],
    "source": [
-    "print(f\"95% CI for RMSE: ({rmse_lower:.4f}, {rmse_upper:.4f})\")"
+    "### 모델 기타 활용법"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 모델 기타 활용법"
+    "머신러닝 모델은 단순히 예측을 위해서만 사용되지는 않는다.\n",
+    "모델 종류에 따라 예측값 계산과 함께 다른 기능을 제공하기도 한다.\n",
+    "\n",
+    "예를 들어, 훈련이 잘 진행된 랜덤 포레스트 모델은 \n",
+    "입력 데이터셋의 각 특성이 모델이 예측값을 계산할 때 얼마나 많이 기여하는가를\n",
+    "특성 중요도라는 기준으로 훈련 과정중에 평가한다.\n",
+    "\n",
+    "캘리포니아 주택가격 예측을 위해 최적화된 랜덤 포레스트 모델은\n",
+    "`feature_importances_` 속성에 특성별 중요도를 저장해 두며,\n",
+    "확인 결과 '중위소득의 로그 변환값'(`log__median_income`) 특성의 중요도가 가장 높다.  그 다음으로는 해안 근접도 특성 중에서 특히 '내륙'(`INLAND`) 특성의 중요도가 높다.\n",
+    "언급된 나머지 8개 특성은 침실 비율, 가구당 방 수, 가구당 인원, 그리고 5 개의 군집 번호 등이며,\n",
+    "중위소득의 로그 변환값 특성의 중요도가 압도적으로 높다."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 136,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "[(np.float64(0.18599734460509476), 'log__median_income'),\n",
-       " (np.float64(0.07338850855844489), 'cat__ocean_proximity_INLAND'),\n",
-       " (np.float64(0.06556941990883976), 'bedrooms__ratio'),\n",
-       " (np.float64(0.053648710076725316), 'rooms_per_house__ratio'),\n",
-       " (np.float64(0.04598870861894749), 'people_per_house__ratio'),\n",
-       " (np.float64(0.04175269214442519), 'geo__Cluster 30 similarity'),\n",
-       " (np.float64(0.025976797232869678), 'geo__Cluster 25 similarity'),\n",
-       " (np.float64(0.023595895886342255), 'geo__Cluster 36 similarity'),\n",
-       " (np.float64(0.02021056221732893), 'geo__Cluster 9 similarity'),\n",
-       " (np.float64(0.01860691707666145), 'geo__Cluster 34 similarity'),\n",
-       " (np.float64(0.018137988374628867), 'geo__Cluster 37 similarity'),\n",
-       " (np.float64(0.01740435316632675), 'geo__Cluster 18 similarity'),\n",
-       " (np.float64(0.016778386143844894), 'geo__Cluster 1 similarity'),\n",
-       " (np.float64(0.015459009666188978), 'geo__Cluster 7 similarity'),\n",
-       " (np.float64(0.015325731028175924), 'geo__Cluster 32 similarity'),\n",
-       " (np.float64(0.015073772015038348), 'geo__Cluster 13 similarity'),\n",
-       " (np.float64(0.014272160962173805), 'geo__Cluster 35 similarity'),\n",
-       " (np.float64(0.014180636461860479), 'geo__Cluster 0 similarity'),\n",
-       " (np.float64(0.013746364498238989), 'geo__Cluster 3 similarity'),\n",
-       " (np.float64(0.01357230570846952), 'geo__Cluster 28 similarity'),\n",
-       " (np.float64(0.01294034969422872), 'geo__Cluster 26 similarity'),\n",
-       " (np.float64(0.012738123746761944), 'geo__Cluster 31 similarity'),\n",
-       " (np.float64(0.011654237215152624), 'geo__Cluster 19 similarity'),\n",
-       " (np.float64(0.011628003598059723), 'geo__Cluster 6 similarity'),\n",
-       " (np.float64(0.011134113333125398), 'geo__Cluster 24 similarity'),\n",
-       " (np.float64(0.011042979326385049), 'remainder__housing_median_age'),\n",
-       " (np.float64(0.010907388443940418), 'geo__Cluster 43 similarity'),\n",
-       " (np.float64(0.010847192663592166), 'geo__Cluster 44 similarity'),\n",
-       " (np.float64(0.010592244492858267), 'geo__Cluster 10 similarity'),\n",
-       " (np.float64(0.010512467290844922), 'geo__Cluster 23 similarity'),\n",
-       " (np.float64(0.01045866561538645), 'geo__Cluster 41 similarity'),\n",
-       " (np.float64(0.010261910692851673), 'geo__Cluster 40 similarity'),\n",
-       " (np.float64(0.009757306983097491), 'geo__Cluster 2 similarity'),\n",
-       " (np.float64(0.00965993322211448), 'geo__Cluster 12 similarity'),\n",
-       " (np.float64(0.009574969190852869), 'geo__Cluster 14 similarity'),\n",
-       " (np.float64(0.008199144719918425), 'geo__Cluster 20 similarity'),\n",
-       " (np.float64(0.008141941480860806), 'geo__Cluster 33 similarity'),\n",
-       " (np.float64(0.007596761219964691), 'geo__Cluster 8 similarity'),\n",
-       " (np.float64(0.0075762980128490295), 'geo__Cluster 22 similarity'),\n",
-       " (np.float64(0.007346290789504319), 'geo__Cluster 39 similarity'),\n",
-       " (np.float64(0.006898774333063982), 'geo__Cluster 4 similarity'),\n",
-       " (np.float64(0.0067947318450798395), 'log__total_rooms'),\n",
-       " (np.float64(0.006514889773323568), 'log__population'),\n",
-       " (np.float64(0.006350528211987125), 'geo__Cluster 27 similarity'),\n",
-       " (np.float64(0.006337558749902337), 'geo__Cluster 16 similarity'),\n",
-       " (np.float64(0.006231053672395539), 'geo__Cluster 38 similarity'),\n",
-       " (np.float64(0.0061213483458714855), 'log__households'),\n",
-       " (np.float64(0.005849842001582111), 'log__total_bedrooms'),\n",
-       " (np.float64(0.0056783104666850125), 'geo__Cluster 15 similarity'),\n",
-       " (np.float64(0.005479729990673467), 'geo__Cluster 29 similarity'),\n",
-       " (np.float64(0.005348325088535128), 'geo__Cluster 42 similarity'),\n",
-       " (np.float64(0.004866251452445486), 'geo__Cluster 17 similarity'),\n",
-       " (np.float64(0.004495340541933027), 'geo__Cluster 11 similarity'),\n",
-       " (np.float64(0.004418821635620684), 'geo__Cluster 5 similarity'),\n",
-       " (np.float64(0.0035344732505291285), 'geo__Cluster 21 similarity'),\n",
-       " (np.float64(0.001832424657341851), 'cat__ocean_proximity_<1H OCEAN'),\n",
-       " (np.float64(0.0015282226447271795), 'cat__ocean_proximity_NEAR OCEAN'),\n",
-       " (np.float64(0.0004325970342247361), 'cat__ocean_proximity_NEAR BAY'),\n",
-       " (np.float64(3.0190221102670295e-05), 'cat__ocean_proximity_ISLAND')]"
+       "[(np.float64(0.18836603202647126), 'log__median_income'),\n",
+       " (np.float64(0.07795960969938898), 'cat__ocean_proximity_INLAND'),\n",
+       " (np.float64(0.06110388595864347), 'bedrooms__ratio'),\n",
+       " (np.float64(0.05772194900488602), 'rooms_per_house__ratio'),\n",
+       " (np.float64(0.04569274355282605), 'people_per_house__ratio'),\n",
+       " (np.float64(0.041977095119231075), 'geo__Cluster 30 similarity'),\n",
+       " (np.float64(0.024893290428216707), 'geo__Cluster 9 similarity'),\n",
+       " (np.float64(0.02349145973584661), 'geo__Cluster 36 similarity'),\n",
+       " (np.float64(0.021384735075780065), 'geo__Cluster 18 similarity'),\n",
+       " (np.float64(0.019231937253583756), 'geo__Cluster 3 similarity')]"
       ]
      },
+     "execution_count": 136,
      "metadata": {},
-     "output_type": "display_data"
+     "output_type": "execute_result"
     }
    ],
    "source": [
     "feature_importances = final_model[\"random_forest\"].feature_importances_\n",
-    "sorted(zip(feature_importances,\n",
-    "           final_model[\"preprocessing\"].get_feature_names_out()),\n",
-    "       reverse=True)"
+    "\n",
+    "important_features = sorted(zip(feature_importances,\n",
+    "                                final_model[\"preprocessing\"].get_feature_names_out()),\n",
+    "                            reverse=True)\n",
+    "important_features[:10]"
    ]
   },
   {
@@ -27733,12 +27718,48 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Save the final model:"
+    "최적의 모델을 훈련시키는 과정이 매우 길 수 있다.\n",
+    "따라서 한 번 훈련된 좋은 모델은 파일로 저장해 놓아야 한다.\n",
+    "그러면 모델을 활용하고자 할 때 저장된 파일을 모델로 불러와서\n",
+    "훈련 없이 바로 활용할 수 있다.\n",
+    "또한 새롭게 훈련시킨 모델이 적절하지 않다고 판단되어\n",
+    "이전 버전의 모델로 되돌려야 하는 상황이 발생할 수도 있기에\n",
+    "잘 훈련된 모델의 저장은 필수적이다.\n",
+    "\n",
+    "모델의 저장과 불러오기는 각각 `joblib` 모듈의\n",
+    "`dump()` 함수와 `load()` 함수를  활용한다.\n",
+    "\n",
+    "- 저장하기\n",
+    "\n",
+    "    ```python\n",
+    "    import joblib\n",
+    "    joblib.dump(final_model, \"my_california_housing_model.pkl\")\n",
+    "    ```\n",
+    "- 불러오기와 활용\n",
+    "\n",
+    "    ```python\n",
+    "    final_model_reloaded = joblib.load(\"my_california_housing_model.pkl\")\n",
+    "    final_model_reloaded.predict(X_test)\n",
+    "    ```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**모델 저장**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "아래 코드를 실행하면 지정된 경로에 최적의 모델을 pickle 파일로 저장한다."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 144,
+   "execution_count": 137,
    "metadata": {},
    "outputs": [
     {
@@ -27747,7 +27768,7 @@
        "['my_california_housing_model.pkl']"
       ]
      },
-     "execution_count": 144,
+     "execution_count": 137,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -27762,52 +27783,49 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now you can deploy this model to production. For example, the following code could be a script that would run in production:"
+    "저장된 모델을 다시 불러와 바로 실전에 투입할 수 있다.\n",
+    "다만, 저장된 모델 활용에 필요한 라이브러리 임포트를 함께 해야 한다.\n",
+    "그렇지 않으면 모델 불러오고 활용할 때 필요한 클래스나 함수가 정의되지 않아 오류가 발생한다."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 145,
+   "execution_count": 140,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[440208.11 457822.08 106671.   100105.   330637.02]\n"
+     ]
+    }
+   ],
    "source": [
     "import joblib\n",
     "\n",
-    "# extra code – excluded for conciseness\n",
-    "from sklearn.cluster import KMeans\n",
-    "from sklearn.base import BaseEstimator, TransformerMixin\n",
-    "from sklearn.metrics.pairwise import rbf_kernel\n",
+    "### 중요 안내 시작 ###\n",
+    "# 저장된 모델 활용에 필요한 라이브러리 임포트를 함께 해야 함.\n",
+    "# 그렇지 않으면 모델 로드 시 필요한 클래스나 함수가 정의되지 않아 오류 발생\n",
     "\n",
-    "def column_ratio(X):\n",
-    "    return X[:, [0]] / X[:, [1]]\n",
+    "# from sklearn.cluster import KMeans\n",
+    "# from sklearn.base import BaseEstimator, TransformerMixin\n",
+    "# from sklearn.metrics.pairwise import rbf_kernel\n",
+    "\n",
+    "# def column_ratio(X):\n",
+    "#     return X[:, [0]] / X[:, [1]]\n",
     "\n",
     "#class ClusterSimilarity(BaseEstimator, TransformerMixin):\n",
     "#    [...]\n",
+    "### 중요 안내 종료 ###\n",
     "\n",
+    "# 모델 불러오기\n",
     "final_model_reloaded = joblib.load(\"my_california_housing_model.pkl\")\n",
     "\n",
-    "new_data = housing.iloc[:5]  # pretend these are new districts\n",
-    "predictions = final_model_reloaded.predict(new_data)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 146,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([441046.12, 454713.09, 104832.  , 101316.  , 336181.05])"
-      ]
-     },
-     "execution_count": 146,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "predictions"
+    "# 새로운 데이터에 대한 예측에 활용 예제\n",
+    "new_data = housing.iloc[:5]  # 새로운 데이터라고 가정\n",
+    "predictions = final_model_reloaded.predict(new_data)\n",
+    "print(predictions)"
    ]
   }
  ],

-Original file line number
+Diff line change
@@ @@ -1,2 +1,5 @@ @@
 # MyST build outputs
 _build
++
 +# 저장된 모델 무시
 +*.pkl