This repository provides a complete, reproducible pipeline for generating high-quality synthetic seed data for a reinforcement learning (RL) environment simulating Asana, an enterprise project management platform.
The generated dataset models a realistic B2B SaaS organization (~7,000 users) using Asana across Product, Marketing, and Operations teams. The primary goal is to create data that supports meaningful evaluation and fine-tuning of computer-use AI agents, while avoiding unrealistic shortcuts or uniform distributions.
The output is a fully populated SQLite database representing a realistic Asana workspace, including:
- Tasks
- Subtasks
- Comments
- Collaboration metadata
This dataset can be used for testing RL agents in a realistic enterprise collaboration setting.
- Realistic user and team structure across multiple departments
- Task hierarchies with subtasks
- Rich metadata including comments and collaboration events
- Fully reproducible synthetic dataset
- Clone the repository:
git clone <repository-url>
- Follow the instructions in the pipeline scripts to generate the synthetic database.
- Training reinforcement learning agents for productivity tools
- Evaluating AI agents in realistic task and collaboration scenarios
- Testing analytics or reporting algorithms on enterprise task data
asana-rl-seed/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── schema.sql # SQLite database schema (DDL)
├── .env.example # Example environment configuration
├── src/
│ ├── main.py # Pipeline orchestration
│ ├── generators/ # Data generation modules
│ │ ├── users.py
│ │ ├── teams.py
│ │ ├── projects.py
│ │ ├── sections.py
│ │ ├── tasks.py
│ │ ├── subtasks.py
│ │ ├── comments.py
│ │ └── tags.py
│ ├── models/
│ │ └── config.py # Central configuration
│ └── utils/ # Helper utilities
├── prompts/ # Prompt templates (if applicable)
└── output/
└── asana_simulation.sqlite # Generated SQLite database
The database schema models the following core entities:
organizationsusersteamsteam_membershipsprojectssectionstaskssubtaskscommentstagstask_tags
flowchart LR
A[main.py<br/>Pipeline Orchestrator]
A --> B[schema.sql<br/>SQLite Schema]
B --> C[(asana_simulation.sqlite)]
A --> D[User Generator]
A --> E[Team & Membership Generator]
A --> F[Project Generator]
A --> G[Section Generator]
A --> H[Task Generator]
A --> I[Subtask Generator]
A --> J[Comment Generator]
A --> K[Tag Generator]
D --> C
E --> C
F --> C
G --> C
H --> C
I --> C
J --> C
K --> C
subgraph Generators
D[src/generators/users.py]
E[src/generators/teams.py]
F[src/generators/projects.py]
G[src/generators/sections.py]
H[src/generators/tasks.py]
I[src/generators/subtasks.py]
J[src/generators/comments.py]
K[src/generators/tags.py]
end
subgraph Configuration
L[models/config.py]
end
L --> A
Relationships enforce:
- Hierarchical task structure
- Team-scoped projects
- Realistic user assignments
- Referential integrity across all entities
Entity-Relationship Diagram (ERD) is provided separately in the documentation (generated using dbdiagram.io) Database Diagram
The pipeline is orchestrated via src/main.py and executed top-to-bottom in a single database connection.
- Initialize schema
- Insert organization
- Generate users
- Generate teams and team memberships
- Generate projects
- Generate sections
- Generate tasks
- Generate subtasks
- Generate comments
- Generate tags
Each step commits data incrementally while preserving consistency.
- ~7,000 users
- ~5% admins, ~95% members
- ~95% active users
- Company-domain emails with collision-safe disambiguation
- Join dates spread across the last 24 months
- Product, Marketing, Operations
- Non-uniform team sizes
- Cross-functional membership for a minority of users
- Team-scoped
- Start and due dates included
- Status distribution: planned / active / completed
- ~40–60 tasks per project
- Section distribution:
- ~45% To Do
- ~35% In Progress
- ~20% Done
- ~15% unassigned tasks
- Due dates clustered around project deadlines
- Includes overdue and undated tasks
- ~30% of tasks have subtasks
- 2–5 subtasks per parent task
- Hierarchical completion consistency enforced
- 0–5 comments per task
- Authored by assignee or teammates
- Timestamps always within task lifetime
- Shared tag vocabulary
- ~40% of tasks tagged
- 1–2 tags per task
The generator enforces strict temporal rules critical for RL policy learning:
- Tasks are never completed before creation
- Subtasks follow parent task timelines
- Comments occur after task creation and before completion
- Project start dates precede due dates
All relationships are enforced using foreign keys and controlled insertion logic:
- Users must belong to an organization
- Team memberships reference valid users and teams
- Tasks reference valid projects and sections
- Subtasks reference valid parent tasks
- Comments reference valid tasks and authors
- Task-tag relationships are many-to-many
- Python 3.10+
- SQLite (bundled with Python)
- Python packages:
faker
Install dependencies:
pip install -r requirements.txtFrom the project root:
python src/main.pyoutput/asana_simulation.sqlitesqlite3 output/asana_simulation.sqlite.tables
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM tasks;
SELECT COUNT(*) FROM comments;-
The pipeline is deterministic up to random seeds
-
Deleting the SQLite file and re-running regenerates the dataset
-
All configuration values are centralized in models/config.py
- Custom fields are documented conceptually but not physically implemented to reduce schema complexity
- Projects are team-scoped to minimize ownership ambiguity
- Subtasks store project_id as a denormalization for RL efficiency
- All trade-offs are intentional and documented
This dataset is designed for:
- Reinforcement learning environment simulation
- Evaluation of computer-use AI agents
- Research on task planning and workflow automation
- Synthetic benchmarking of enterprise productivity tools
This repository was created as part of a Research Scientist Internship take-home assignment, with emphasis on realism, methodological rigor, and research-grade documentation.