Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
11c16e2
added github workflow for python app and fix for flake8 issue.
sanlinnaing Jan 7, 2026
98c8be1
refined the README
sanlinnaing Jan 7, 2026
8ccac17
route UI also from nginx gateway
sanlinnaing Jan 7, 2026
f410622
added infra diagram
sanlinnaing Jan 7, 2026
aa11a54
feat: added UserGroup management for user and form layout for dynamic UI
sanlinnaing Jan 8, 2026
612adf2
fix: for trailing slash ("/") issue fixed and added route in api-gateway
sanlinnaing Jan 8, 2026
dc20161
feat: added user group management in UI and adopted to change user ma…
sanlinnaing Jan 8, 2026
a40885b
feat: add basic functions of dynamic form layout.
sanlinnaing Jan 8, 2026
4461e31
fix: for type rendering in dynamic form
sanlinnaing Jan 8, 2026
f88b335
change left side panel layout and added new schema list page for UX b…
sanlinnaing Jan 10, 2026
c57b0df
changed from AL2 to AL2023
sanlinnaing Jan 10, 2026
4f2b2a7
changed from SSM to SecretsManager
sanlinnaing Jan 10, 2026
55b917f
fix(deps): add pytest-asyncio to requirements
sanlinnaing Jan 10, 2026
6286433
fix(deps): add mongomock and mongomock-motor to auth-service
sanlinnaing Jan 10, 2026
b79e474
added CDK IaC and shell script for deploy and destroy.
sanlinnaing Jan 10, 2026
82a5c4f
add ALB route for added features.
sanlinnaing Jan 10, 2026
5c467cc
feat: enhanced the layout designer for properties.
sanlinnaing Jan 11, 2026
60b69e3
added opentelemetry for metric and trace to New Relic
sanlinnaing Jan 12, 2026
e084b12
Merge branch 'prod' into main
sanlinnaing Jan 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions Observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Observability Strategy: Custom Metrics vs. OpenTelemetry

This document summarizes the discussion on observability strategies for the Dynaman project, covering the trade-offs between using direct custom metrics and adopting the OpenTelemetry standard.

## Initial Question: Custom Metrics vs. OpenTelemetry

The primary question was to understand the difference between implementing custom metrics manually versus using OpenTelemetry with AWS CloudWatch, especially concerning cost and implementation effort.

### High-Level Comparison

| Feature | Custom Metrics (e.g., using `boto3`) | CloudWatch with OpenTelemetry |
| :--- | :--- | :--- |
| **Concept** | **Direct API Interaction.** Manually construct and send metric data directly to the CloudWatch API. | **Standardized Instrumentation.** Use a vendor-neutral API (OTel) in the app, which then sends data to a backend via a configurable "exporter". |
| **Implementation** | Entirely manual. Requires writing specific code for every single metric. | **Automatic & Manual.** Provides auto-instrumentation for frameworks (FastAPI) to capture standard signals (latency, errors) with minimal setup. |
| **Flexibility** | **Low (Vendor Lock-in).** Code is tightly coupled to the AWS `boto3` SDK. Switching providers requires a complete rewrite of monitoring code. | **High (Vendor-Neutral).** Application code is decoupled from the backend. Switching from CloudWatch to another provider is a configuration change. |
| **Cost** | **Direct Cost:** Standard CloudWatch ingestion fees. **Indirect Cost:** High development and maintenance time. | **Direct Cost:** Ingestion fees are the same. **Indirect Cost:** Lower long-term development cost. A small compute cost exists if using the Collector. |
| **Features**| **Metrics only.** | **Metrics, Traces, and Logs.** OTel is designed to handle all three pillars of observability, allowing for rich, correlated data. |

### Does OpenTelemetry create more metrics?

Yes, out-of-the-box, OTel's auto-instrumentation captures a comprehensive set of standard metrics (e.g., latency histograms for every API endpoint), which is more than you would create manually.

However, this is a feature. It provides a rich, detailed view of system health from the start. Crucially, **you have full control to manage cost** by configuring **Views** in the SDK or **Processors** in the Collector to filter, aggregate, or drop metrics before they are sent to CloudWatch.

## The Role of the OpenTelemetry Collector

The next question was about the components needed on the AWS side and the concept of a "sidecar container".

There are two main patterns to get OTel data to CloudWatch:

### Path 1: Direct Export
The application uses an AWS-specific exporter within the OTel SDK to send data directly to CloudWatch APIs.

**Flow:**
`[Your Python App + OTel SDK + AWS Exporter] ----(HTTPS)----> [AWS CloudWatch API]`

### Path 2: Collector Sidecar (Recommended)
The application sends its data to an OTel Collector running as a **sidecar container** alongside the application container in the same ECS Task. The Collector then forwards the data to CloudWatch.

**Flow:**
`[Your App (OTel SDK)] --(localhost)--> [OTel Collector (Sidecar)] --(HTTPS)--> [AWS CloudWatch API]`

**Associated Costs:**
* **CloudWatch Ingestion Cost:** No change. This is the same in both patterns.
* **Compute Cost:** This pattern introduces a small, additional compute cost because the sidecar container requires its own CPU and memory allocation in the ECS task.

The benefits of the collector (improved performance, reliability, centralized configuration) generally outweigh its small compute cost.

## Cluster Capacity Analysis

A request was made to analyze the existing Terraform code to determine if the ECS cluster could handle the additional resource load of an OTel Collector sidecar.

### Analysis Summary

1. **ECS Node:** The cluster runs on a single `t4g.small` instance, which provides **2048 CPU units** and **2048 MiB of memory**.
2. **Current Usage:** The 6 running tasks reserve a total of **1536 CPU units (75%)** and **1536 MiB of memory (75%)**.
3. **Sidecar Impact:** Adding a sidecar with an estimated **128 CPU units** and **128 MiB memory** to the 4 backend tasks results in a new total reservation.
* **New Total CPU:** (2 tasks × 256) + (4 tasks × 384) = **2048 CPU units**
* **New Total Memory:** (2 tasks × 256) + (4 tasks × 384) = **2048 MiB**

### Conclusion

**Yes, the `t4g.small` node is technically sufficient, but it will be at 100% resource reservation.** This is risky and leaves no buffer for the OS, the ECS agent, or deployment activities.

**Recommendation:** To safely run with the OTel Collector sidecar, the instance type should be upgraded to a **`t4g.medium`** to provide a healthy resource buffer.
21 changes: 21 additions & 0 deletions auth-service/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from api.dependencies import get_user_repository, get_db
from domain.entities.user import User, UserRole
from domain.services.security_service import SecurityService
from opentelemetry_config import setup_opentelemetry

@asynccontextmanager
async def lifespan(app: FastAPI):
Expand Down Expand Up @@ -33,6 +34,9 @@ async def lifespan(app: FastAPI):

app = FastAPI(title="Dynaman Auth Service", lifespan=lifespan)

# Setup OpenTelemetry
setup_opentelemetry(app)

# CORS Configuration
origins = [
"http://localhost:5173", # Vite default
Expand All @@ -48,6 +52,23 @@ async def lifespan(app: FastAPI):
allow_headers=["*"],
)

# New Relic UI Config Endpoint
import os
from pydantic import BaseModel

class UiTelemetryConfig(BaseModel):
new_relic_browser_ingest_key: str
new_relic_browser_app_id: str
environment: str

@app.get("/api/v1/config/ui", response_model=UiTelemetryConfig)
async def ui_config():
return UiTelemetryConfig(
new_relic_browser_ingest_key=os.environ.get("NEW_RELIC_BROWSER_INGEST_KEY", ""),
new_relic_browser_app_id=os.environ.get("NEW_RELIC_BROWSER_APP_ID", ""),
environment=os.environ.get("APP_ENVIRONMENT", "unknown"),
)

@app.get("/health")
async def health_check():
return {"status": "ok"}
Expand Down
87 changes: 87 additions & 0 deletions auth-service/opentelemetry_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


def setup_opentelemetry(app):
"""Configure OpenTelemetry for the application."""

if os.environ.get("OTEL_ENABLED") != "true":
print("OpenTelemetry is disabled.")
return

print("OpenTelemetry is enabled. Initializing...")
# Get the service name from an environment variable, default to 'auth-service'
service_name = os.environ.get("OTEL_SERVICE_NAME", "auth-service")
deployment_environment = os.environ.get("APP_ENVIRONMENT", "unknown")

# Set up a resource with the service name
resource = Resource(attributes={
"service.name": service_name,
"deployment.environment": deployment_environment,
})

# --- TRACES SETUP ---
# Set up a TracerProvider
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

# Configure the OTLP exporter to send data to the collector sidecar
# The endpoint is the default for the OTel Collector's gRPC port
otlp_exporter = OTLPSpanExporter(
endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
insecure=True # Use insecure connection for localhost communication
)

# Use a BatchSpanProcessor to send spans in batches
span_processor = BatchSpanProcessor(otlp_exporter)
tracer_provider.add_span_processor(span_processor)
# ---------------------

# --- METRICS SETUP ---
# Configure the Metric Exporter (pointing to Collector gRPC port)
metric_exporter = OTLPMetricExporter(
endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"),
insecure=True
)

# Metrics are sent periodically (default is every 60s)
reader = PeriodicExportingMetricReader(metric_exporter)

# Set up the MeterProvider with the same resource tags
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meter_provider)
# ---------------------

# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app, tracer_provider=tracer_provider)

# Instrument Pymongo for MongoDB queries (which covers motor)
def request_hook(span, event):
if span and span.is_recording():
# Manually ensure New Relic sees the database name
# 'event.database_name' is provided by the pymongo monitoring API
span.set_attribute("db.name", event.database_name)
span.set_attribute("db.namespace", event.database_name) # Newer standard

# Optional: capture the collection name if missing
if hasattr(event, 'command') and event.command:
collection = event.command.get(event.command_name)
if isinstance(collection, str):
span.set_attribute("db.collection.name", collection)

# Apply the instrumentation with the hook
PymongoInstrumentor().instrument(tracer_provider=tracer_provider, request_hook=request_hook)

# Instrument the requests library for any outgoing HTTP calls
RequestsInstrumentor().instrument(tracer_provider=tracer_provider)
8 changes: 8 additions & 0 deletions auth-service/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,11 @@ pytest-cov
flake8
mongomock
mongomock-motor

# OpenTelemetry
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
opentelemetry-instrumentation-requests
opentelemetry-instrumentation-pymongo
39 changes: 39 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,38 @@ services:
depends_on:
- mongodb

otel-collector:
# The user-requested version 0.143.1 seems incorrect, but is required for the connector configuration. Please double-check the version number if issues persist.
image: otel/opentelemetry-collector-contrib:0.143.1
container_name: dyna_otel_collector
restart: always
command: ["--config=/etc/otel-collector-local-config.yaml"]
volumes:
- ./otel-collector-local-config.yaml:/etc/otel-collector-local-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
environment:
# The NEW_RELIC_LICENSE_KEY must be set in your environment (e.g., in a .env file)
- NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY:?Please set NEW_RELIC_LICENSE_KEY in your environment}
- NEW_RELIC_OTLP_ENDPOINT=${NEW_RELIC_OTLP_ENDPOINT:-https://otlp.nr-data.net:443}

engine-metadata:
build: ./engine
container_name: dyna_engine_metadata
environment:
- APP_MODE=metadata
- MONGODB_URL=mongodb://mongodb:27017
- DATABASE_NAME=dynaman
- OTEL_ENABLED=true
- OTEL_SERVICE_NAME=engine-meta
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- OTEL_EXPORTER_OTLP_PROTOCOL=grpc
- OTEL_EXPORTER_OTLP_INSECURE=true
- APP_ENVIRONMENT=development
depends_on:
- mongodb
- otel-collector
restart: always

engine-execution:
Expand All @@ -42,8 +65,15 @@ services:
- APP_MODE=execution
- MONGODB_URL=mongodb://mongodb:27017
- DATABASE_NAME=dynaman
- OTEL_ENABLED=true
- OTEL_SERVICE_NAME=engine-exec
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- OTEL_EXPORTER_OTLP_PROTOCOL=grpc
- OTEL_EXPORTER_OTLP_INSECURE=true
- APP_ENVIRONMENT=development
depends_on:
- mongodb
- otel-collector
restart: always

auth-service:
Expand All @@ -53,8 +83,17 @@ services:
- MONGODB_URL=mongodb://mongodb:27017
- DATABASE_NAME=dynaman_auth
- SECRET_KEY=09d25e094faa6ca2556c818166b7a9563b93f7099f6f0f4caa6cf63b88e8d3e7
- OTEL_ENABLED=true
- OTEL_SERVICE_NAME=auth-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- OTEL_EXPORTER_OTLP_PROTOCOL=grpc
- OTEL_EXPORTER_OTLP_INSECURE=true
- APP_ENVIRONMENT=development
- NEW_RELIC_BROWSER_INGEST_KEY=${NEW_RELIC_BROWSER_INGEST_KEY}
- NEW_RELIC_BROWSER_APP_ID=${NEW_RELIC_BROWSER_APP_ID}
depends_on:
- mongodb
- otel-collector
restart: always

api-gateway:
Expand Down
7 changes: 7 additions & 0 deletions dynaman-ui/nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,11 @@ server {
# location /api/ {
# proxy_pass http://engine:8000/;
# }

# Expose Nginx stub status metrics
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1; # Allow access only from localhost (or specific IPs)
}
}
Loading