Enhanced Source API Guide

The Kelime Enhanced Source API supports multiple input types for extracting vocabulary from real-world content. This guide covers all supported input types, security measures, and usage examples.

🚀 Quick Start

Endpoint

POST /api/sources/enhanced/

Authentication

All requests require user authentication. Include session cookies or authorization headers.

Basic Usage

Send one input type per request with a title:

curl -X POST http://localhost:8000/api/sources/enhanced/ \
  -H "Content-Type: application/json" \
  -d '{"title": "My Source", "manual_text": "Your content here..."}'

📋 Supported Input Types

1. 📄 PDF File Upload

Extract text from PDF documents with automatic text recognition.

Parameters:

title: Source title (required)
pdf_file: PDF file upload (required)

Example:

import requests

with open('document.pdf', 'rb') as f:
    files = {'pdf_file': ('document.pdf', f, 'application/pdf')}
    data = {'title': 'Academic Paper on Language Learning'}
    
    response = requests.post(
        'http://localhost:8000/api/sources/enhanced/',
        data=data,
        files=files
    )

Features:

✅ Multi-page text extraction
✅ UTF-8 text normalization
✅ 10MB file size limit
✅ MIME type validation
✅ Metadata reporting (pages, word count)

2. 🌐 Website URL

Scrape and extract clean text from web pages.

Parameters:

title: Source title (required)
web_url: Valid HTTP/HTTPS URL (required)

Example:

data = {
    'title': 'BBC News Article',
    'web_url': 'https://www.bbc.com/news/technology-12345'
}

response = requests.post(
    'http://localhost:8000/api/sources/enhanced/',
    data=data
)

Features:

✅ Automatic content area detection (<main>, <article>)
✅ Ad/navigation removal
✅ 15-second timeout
✅ Security URL validation
✅ Page title extraction

Blocked URLs:

❌ Private networks (127.x.x.x, 192.168.x.x, 10.x.x.x)
❌ Localhost access
❌ Non-HTTP(S) protocols

3. 📺 YouTube Video

Extract subtitles/transcripts from YouTube videos.

Parameters:

title: Source title (required)
youtube_url: YouTube video URL (required)

Example:

data = {
    'title': 'TED Talk - The Power of Language',
    'youtube_url': 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
}

response = requests.post(
    'http://localhost:8000/api/sources/enhanced/',
    data=data
)

Supported URL Formats:

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID

Features:

✅ Auto-detect available languages
✅ Prefer English transcripts
✅ Clean transcript formatting
✅ Remove timestamps and annotations
✅ Fallback to any available language

Limitations:

❌ Videos with disabled transcripts
❌ Private/unlisted videos without transcripts
❌ Videos without any captions

4. 📺 SRT Subtitle Files

Parse and extract text from subtitle files.

Parameters:

title: Source title (required)
srt_file: SRT subtitle file upload (required)

Example:

with open('movie_subtitles.srt', 'rb') as f:
    files = {'srt_file': ('movie_subtitles.srt', f, 'text/plain')}
    data = {'title': 'Movie Dialogue for Language Learning'}
    
    response = requests.post(
        'http://localhost:8000/api/sources/enhanced/',
        data=data,
        files=files
    )

Features:

✅ Automatic timestamp removal
✅ HTML tag cleaning
✅ UTF-8 and Latin-1 encoding support
✅ Subtitle formatting removal
✅ 10MB file size limit

Sample SRT Format:

1
00:00:01,000 --> 00:00:04,000
Hello everyone, welcome to our lesson.

2
00:00:04,500 --> 00:00:08,000
Today we will learn advanced vocabulary.

5. 📝 Manual Text Input

Direct text input for custom content.

Parameters:

title: Source title (required)
manual_text: Text content (required, min 10 characters)

Example:

data = {
    'title': 'Custom Vocabulary Text',
    'manual_text': '''
    Vocabulary acquisition is fundamental to language learning.
    Students benefit from exposure to authentic materials that
    challenge their lexical knowledge and promote retention.
    '''
}

response = requests.post(
    'http://localhost:8000/api/sources/enhanced/',
    data=data
)

Features:

✅ Direct content control
✅ No external dependencies
✅ Instant processing
✅ Perfect for custom materials

🔒 Security Features

File Upload Security

Size Limits: 10MB for PDFs/SRTs
MIME Type Validation: When python-magic is available
Extension Validation: .pdf, .srt extensions required
Content Scanning: Basic malicious content detection

URL Security

Protocol Restriction: Only HTTP/HTTPS allowed
Network Blocking: Private/local networks blocked
Timeout Protection: 15-second request timeout
User Agent: Proper browser headers to avoid blocking

Input Validation

Single Input: Exactly one input type per request
Title Validation: 2-255 character limits
Content Minimum: 10 character minimum for text
YouTube Validation: Valid video ID extraction

📊 Response Format

Success Response (201 Created)

{
  "id": 123,
  "title": "My Source",
  "source_type": "TEXT",
  "created_at": "2025-01-23T10:30:00Z",
  "analysis": {
    "coverage": 65.5,
    "total_words": 1240,
    "unique_words": 287,
    "known_words": 188,
    "new_words": 99,
    "words_processed": 287,
    "processing_status": "success",
    "characters": 6543,
    "pages": 3
  },
  "content_preview": "Learning vocabulary is essential for language acquisition. Students who practice regularly tend to improve...",
  "success_message": "✅ 287 unique words extracted and processed successfully!"
}

Error Response (400 Bad Request)

{
  "error": "Exactly one input type must be provided: pdf_file, srt_file, web_url, youtube_url, or manual_text"
}

Content Parsing Error

{
  "error": "Content parsing failed: No transcript found for this video"
}

🧮 Analysis Metrics

Each successful response includes detailed analysis:

Metric	Description
`coverage`	Percentage of words user already knows
`total_words`	Total word instances in content
`unique_words`	Number of unique words found
`known_words`	Words user has already learned
`new_words`	New words added to learning queue
`words_processed`	Total words processed and stored
`characters`	Character count of extracted content
`processing_status`	`success` or `no_words_found`

Additional Metadata by Type

PDF Files:

total_pages: Number of pages processed
characters: Total character count

Web Pages:

url: Final URL after redirects
title: Page title from <title> tag
status_code: HTTP response code

YouTube Videos:

video_id: Extracted YouTube video ID
language: Transcript language code
entries_count: Number of transcript segments

SRT Files:

subtitles_count: Number of subtitle entries
duration: Total duration from first to last subtitle

🎯 User Flow Examples

PDF Academic Paper

User uploads research paper PDF
System extracts text from all pages
Identifies 450 unique academic terms
Shows "127 new words added to your learning queue"
User starts reviewing unknown terminology

YouTube Educational Video

User pastes TED Talk URL
System fetches English transcript
Extracts 89 unique words from 18-minute video
User learns vocabulary in context of presentation

News Article

User pastes BBC News article URL
System scrapes clean content (removes ads/navigation)
Processes 156 current affairs vocabulary
User improves news comprehension skills

Movie Subtitles

User uploads movie SRT file
System extracts dialogue (removes timestamps)
Finds 234 conversational expressions
User learns natural spoken language patterns

🔧 Integration Examples

Frontend JavaScript

// File upload example
const formData = new FormData();
formData.append('title', 'My PDF Document');
formData.append('pdf_file', fileInput.files[0]);

fetch('/api/sources/enhanced/', {
  method: 'POST',
  body: formData,
  headers: {
    'X-CSRFToken': getCSRFToken()
  }
})
.then(response => response.json())
.then(data => {
  console.log(`Success: ${data.success_message}`);
  console.log(`Analysis:`, data.analysis);
});

Python Script

import requests

def upload_content(title, content_type, content_data):
    """Upload content to Kelime API"""
    api_url = 'http://localhost:8003/api/sources/enhanced/'
    
    if content_type == 'text':
        data = {'title': title, 'manual_text': content_data}
        response = requests.post(api_url, data=data)
    
    elif content_type == 'url':
        data = {'title': title, 'web_url': content_data}
        response = requests.post(api_url, data=data)
    
    elif content_type == 'youtube':
        data = {'title': title, 'youtube_url': content_data}
        response = requests.post(api_url, data=data)
    
    if response.status_code == 201:
        result = response.json()
        print(f"✅ {result['success_message']}")
        return result
    else:
        print(f"❌ Error: {response.text}")
        return None

# Usage examples
upload_content("News Article", "url", "https://bbc.com/news/article")
upload_content("Custom Text", "text", "Your vocabulary content here...")
upload_content("TED Talk", "youtube", "https://youtube.com/watch?v=...")

🚨 Error Handling

Common Error Scenarios

File Too Large

{
  "error": "Content parsing failed: File too large. Maximum size: 10MB"
}

Invalid YouTube URL

{
  "error": "Invalid YouTube URL format"
}

Transcript Not Available

{
  "error": "Content parsing failed: No transcript found for this video"
}

Website Blocked

{
  "error": "Content parsing failed: Private/local network URLs are not allowed"
}

Multiple Inputs

{
  "error": "Exactly one input type must be provided: pdf_file, srt_file, web_url, youtube_url, or manual_text"
}

Best Practices

Always validate input on frontend before sending
Handle timeouts for web scraping (15-second limit)
Check file sizes before upload (10MB limit)
Provide user feedback during processing
Show analysis results to user after success

📈 Performance Notes

PDF Processing: ~2-5 seconds for typical documents
Web Scraping: ~3-10 seconds depending on site speed
YouTube Transcripts: ~1-3 seconds for most videos
SRT Processing: ~1-2 seconds for typical files
Manual Text: Instant processing

Optimization Tips

Cache results when possible
Process in background for large files
Show progress indicators for slow operations
Batch multiple small texts if applicable

🔄 Legacy API Compatibility

The original /api/sources/ endpoint remains available for backward compatibility:

# Legacy endpoint (still works)
data = {
    'title': 'Manual Source',
    'source_type': 'TEXT', 
    'content': 'Your text content here...'
}

response = requests.post('/api/sources/', data=data)

Migration Guide:

Replace content with manual_text
Remove source_type (auto-detected)
Use /api/sources/enhanced/ endpoint
Update response parsing for new analysis format

📞 Support & Troubleshooting

Debug Mode

Enable Django debug mode to see detailed error traces:

DEBUG = True  # In settings.py

Logging

Check server logs for detailed parsing information:

tail -f logs/django.log

Common Issues

Missing Dependencies: Install required packages from requirements.txt
Authentication: Ensure user is logged in before API calls
CSRF Tokens: Include CSRF token for web requests
File Permissions: Check read permissions for uploaded files

For additional support, check the Django admin logs or contact the development team.

Last updated: January 2025 - Enhanced Source API v1.0

FilesExpand file tree

ENHANCED_API_GUIDE.md

Latest commit

History

ENHANCED_API_GUIDE.md

File metadata and controls

Enhanced Source API Guide

🚀 Quick Start

Endpoint

Authentication

Basic Usage

📋 Supported Input Types

1. 📄 PDF File Upload

2. 🌐 Website URL

3. 📺 YouTube Video

4. 📺 SRT Subtitle Files

5. 📝 Manual Text Input

🔒 Security Features

File Upload Security

URL Security

Input Validation

📊 Response Format

Success Response (201 Created)

Error Response (400 Bad Request)

Content Parsing Error

🧮 Analysis Metrics

Additional Metadata by Type

🎯 User Flow Examples

PDF Academic Paper

YouTube Educational Video

News Article

Movie Subtitles

🔧 Integration Examples

Frontend JavaScript

Python Script

🚨 Error Handling

Common Error Scenarios

Best Practices

📈 Performance Notes

Optimization Tips

🔄 Legacy API Compatibility

📞 Support & Troubleshooting

Debug Mode

Logging

Common Issues