Skip to content

Latest commit

 

History

History
241 lines (173 loc) · 6.64 KB

File metadata and controls

241 lines (173 loc) · 6.64 KB

Parsely - Language Learning Vocabulary Extractor

Parsely is a tool that uses AI to extract vocabulary from language learning course notes (PDF/DOCX files) and stores them in a searchable database. It features both a command-line interface (TUI) and a web interface.

Features

  • AI-Powered Extraction: Uses Claude AI to intelligently extract vocabulary and phrases
  • Document Support: Parses PDF, DOCX, and plain text (TXT) files
  • Deduplication: Automatically skips vocabulary that's already in the database
  • Dual Interface: Choose between CLI (Terminal UI) or Web interface
  • Export: Export vocabulary to JSON for use in other applications
  • Security: Built with security best practices (SQL injection prevention, file validation, etc.)

Requirements

  • Go 1.23 or later
  • Claude API key (get one from Anthropic)
  • Optional: Bun or Node.js for the web frontend (if you want to develop it)

Installation

Clone the repository

git clone https://github.com/parsely/parsely.git
cd parsely

Install dependencies

go mod download

Build the binaries

# Build CLI version
go build -o parsely-cli ./cmd/cli

# Build web version
go build -o parsely-web ./cmd/web

Configuration

Parsely uses environment variables for configuration:

Variable Required Default Description
ANTHROPIC_API_KEY Yes Your Anthropic API key
DATABASE_PATH No /data/parsely.db Path to the SQLite database file
LANGUAGE No auto-detect Target language for extraction
PORT No 8080 Port for the web server
API_TOKEN No Bearer token to protect API endpoints. If unset, auth is disabled (fine for local use). Set this in production.

Deployment

Running locally

The default DATABASE_PATH is /data/parsely.db, which is intended for the Railway deployment (see below). When running locally, override it to a path that exists on your machine:

DATABASE_PATH=parsely.db ANTHROPIC_API_KEY=sk-ant-... go run ./cmd/web

Or export the variables in your shell before running:

export ANTHROPIC_API_KEY="sk-ant-..."
export DATABASE_PATH="parsely.db"
./parsely-web

Deploying to Railway

The project includes a Dockerfile configured for Railway.

  1. Push the repository to GitHub and connect it to a new Railway project.
  2. Add a Volume in Railway and set the mount path to /data.
  3. Set the following environment variables in the Railway service settings:
    • ANTHROPIC_API_KEY — your Anthropic API key (required)
    • API_TOKEN — a secret token to protect your API (recommended);
    • LANGUAGE — target language, e.g. Spanish (optional)
    • DATABASE_PATH — can be left unset; defaults to /data/parsely.db
  4. Railway automatically injects the PORT variable — no action needed.

The SQLite database will be persisted on the mounted volume at /data/parsely.db across deployments and restarts.

Usage

CLI Version

Run the interactive terminal UI:

./parsely-cli

Features:

  • Parse new documents (PDF/DOCX)
  • View all vocabulary
  • Export to JSON
  • Navigate with arrow keys or vim keys (j/k)

Web Version

Start the web server:

./parsely-web

The API will be available at http://localhost:8080

API Endpoints

GET    /api/vocabulary       - List all vocabulary
GET    /api/vocabulary/{id}  - Get specific vocabulary item
DELETE /api/vocabulary/{id}  - Delete vocabulary item
POST   /api/upload           - Upload and process document
POST   /api/export           - Export vocabulary to JSON
GET    /api/stats            - Get vocabulary statistics
GET    /health               - Health check

Authentication

When API_TOKEN is set, all /api/* endpoints require a Bearer token header:

curl -H "Authorization: Bearer your-token" http://localhost:8080/api/vocabulary

The /health endpoint is always public. When API_TOKEN is not set (e.g. local development), no header is required.

Upload Document Example

curl -X POST \
  -H "Authorization: Bearer your-token" \
  -F "file=@/path/to/document.pdf" \
  http://localhost:8080/api/upload

Running Tests

Run all tests with coverage:

go test ./... -cover

Run tests for a specific package:

go test ./internal/db -v
go test ./internal/parser -v
go test ./internal/ai -v
go test ./internal/core -v
go test ./internal/api -v

Project Structure

parsely/
├── cmd/
│   ├── cli/          # CLI application entry point
│   └── web/          # Web server entry point
├── internal/
│   ├── ai/           # Claude AI integration
│   ├── parser/       # PDF/DOCX parsers
│   ├── db/           # SQLite database layer
│   ├── core/         # Core business logic
│   └── api/          # HTTP API handlers
├── testdata/         # Test fixtures
├── go.mod
├── go.sum
├── README.md
└── CLAUDE.md         # Development guidelines

Security Features

  • API Authentication: Bearer token auth protects all endpoints when API_TOKEN is set
  • SQL Injection Prevention: All database queries use parameterized statements
  • Path Traversal Protection: File paths are validated to prevent directory traversal
  • File Size Limits: Maximum 10MB per document
  • File Type Validation: Only PDF and DOCX files accepted
  • Input Sanitization: All user input is validated and sanitized
  • Secure Permissions: Database and temp files created with restrictive permissions

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Write tests first (TDD approach)
  4. Implement your feature
  5. Ensure all tests pass
  6. Submit a pull request

See CLAUDE.md for detailed development guidelines.

Troubleshooting

"ANTHROPIC_API_KEY not set"

Make sure you've exported your API key:

export ANTHROPIC_API_KEY="your-key"

Database Permission Errors

Ensure the database file has proper permissions:

chmod 600 parsely.db

PDF Parsing Errors

Some PDFs may not contain extractable text. Try:

  1. Ensuring the PDF has selectable text (not scanned images)
  2. Using a different PDF viewer to verify text content
  3. Converting scanned PDFs to text-based PDFs using OCR

Large File Errors

Files over 10MB are rejected. Compress or split your documents.

License

MIT License - see LICENSE file for details

Acknowledgments