Complete guide to using Prism for document extraction and knowledge querying.
Prism transforms unstructured documents into searchable, queryable knowledge. It supports:
- PDF files - Technical documents, specifications, reports
- Excel files - Spreadsheets, data tables
- Email files - .msg email archives with attachments
A project is a container for related documents. Each project has:
- Documents - Uploaded source files
- Output - Extracted and processed content
- Workflow Config - Custom question sections
- Search Index - Azure AI Search index
- Knowledge Agent - AI query interface
The processing pipeline transforms raw documents into queryable knowledge:
Upload → Process → Deduplicate → Chunk → Embed → Index → Agent → Query
Workflows are structured question sets for systematically extracting information:
- Sections - Groups of related questions (e.g., "Technical Specs")
- Questions - Individual queries with instructions
- Templates - Section-level prompts that guide the AI
Navigate to http://localhost:3000 and enter your password.
- Go to Projects
- Click New Project
- Enter a descriptive name (e.g., "vendor-proposal-2024")
- Click Create
- Open your project
- Use the upload area to add files:
- Drag and drop multiple files
- Or click to browse
- Supported formats: PDF, XLSX, XLSM, MSG
Execute each pipeline stage in order:
| Stage | Description | Output |
|---|---|---|
| Process | Extract text and images from documents | Markdown files |
| Deduplicate | Identify and remove duplicate content | Clean content |
| Chunk | Split into semantic chunks for search | JSON chunks |
| Embed | Generate vector embeddings | Embedded chunks |
| Index Create | Create Azure AI Search index | Search index |
| Index Upload | Upload chunks to search index | Indexed content |
| Source Create | Create knowledge source wrapper | Knowledge source |
| Agent Create | Create knowledge retrieval agent | Ready for queries |
Click Run on each step, or Run All for the full pipeline.
Prism tracks which documents have already been extracted. When you run the "Process" stage:
- Already-extracted documents are skipped - Saves time and API costs
- Only new documents are processed - Add documents incrementally
- Use "Re-run" to force re-extraction - Click the "Re-run" button to re-process all documents
The extraction status is tracked in output/extraction_status.json.
Once the pipeline is complete:
- Go to Query
- Select your project
- Type a natural language question
- Get AI-generated answers with source citations
- Open your project
- Click Configure Workflow
- Click Add Section
- Enter:
- Name - Section title (e.g., "Technical Specifications")
- Template - Instructions for the AI
Example template:
Answer the following question based on the technical documents.
Focus on specific values, measurements, and standards references.
If the information is not found, state "Not specified in documents."
- Expand a section
- Click Add Question
- Enter:
- Question - The query to answer
- Instructions - Specific guidance for this question
Example:
- Question: "What is the rated voltage?"
- Instructions: "Look for voltage ratings in electrical specifications. Include all voltage levels if multiple exist."
- Go to Workflows
- Select a section
- Click Run Section
- Monitor progress in real-time
- View results when complete
- View all answered questions
- Filter by section
- See completion percentage
- Export to CSV
Each result shows:
- Question - The original query
- Answer - AI-generated response
- References - Source document citations
- Comments - Additional notes
Prism provides a REST API for programmatic access:
| Endpoint | Method | Description |
|---|---|---|
/api/projects |
GET | List all projects |
/api/projects |
POST | Create a project |
/api/projects/{id} |
GET | Get project details |
/api/projects/{id} |
DELETE | Delete a project |
/api/projects/{id}/files |
GET | List project files |
/api/projects/{id}/files |
POST | Upload files |
/api/pipeline/stages |
GET | List pipeline stages |
/api/pipeline/{id}/run |
POST | Run a pipeline stage |
/api/query |
POST | Query documents |
Interactive API docs available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Use clear, descriptive filenames
- Ensure PDFs have selectable text (not scanned images)
- Organize related documents in the same project
- Be specific in your questions
- Include context in instructions
- Use consistent terminology
- Group related questions in sections
- Process documents in batches
- Use focused, specific queries
- Create targeted sections for different topics
- Check Azure credentials in
.env - Verify documents are valid and not corrupted
- Check backend logs:
docker-compose -f infra/docker/docker-compose.yml logs backend
- Verify pipeline completed successfully
- Check if documents contain the information
- Try rephrasing the question
- Add more specific instructions
- Large documents take longer to process
- Check Azure service quotas
- Consider splitting very large files
| Shortcut | Action |
|---|---|
Ctrl+Enter |
Submit query |
Escape |
Close modal |
All data is stored in the projects/ directory:
projects/
{project-name}/
config.json # Project settings (extraction instructions)
documents/ # Uploaded files
output/ # Processed content
extraction_results/ # Extracted markdown
extraction_status.json# Per-document extraction tracking
chunked_documents/ # JSON chunks
embedded_documents/ # With embeddings
results.json # Workflow answers
workflow_config.json # Sections & questions
- All queries use Azure AI services
- Documents remain in your Azure subscription
- API protected by password authentication
- HTTPS recommended for production
- Check logs:
docker-compose -f infra/docker/docker-compose.yml logs - API docs: http://localhost:8000/docs
- See QUICKSTART.md for setup
- Review CLAUDE.md for architecture