A deep learning system for recognizing alphanumeric CAPTCHA images of variable lengths using a Convolutional Recurrent Neural Network (CRNN) architecture with CTC loss.
- Exact Match Accuracy: 63.4%
- Character Error Rate (CER): 0.100
- Model Architecture: CNN + Bidirectional LSTM with CTC decoding
Note: CER is based on Levenstein distance, number of deletions/insertions/edits required to change from the output string to the correct string. Character Error Rate = total number of edits divided by total number of characters.
-
CNN Feature Extractor
- Residual blocks for deep feature learning
- 4-channel input (RGB + Sobel edge detection)
- Multiple Conv + Pooling layers
- Output: 256-channel feature maps
-
Recurrent Sequence Modeling
- 2-layer Bidirectional LSTM (256 hidden units)
- Sequence modeling of variable-length CAPTCHAs
- Dropout for regularization
-
CTC Decoding
- Connectionist Temporal Classification loss
- Greedy decoding for inference
- Supports variable-length output sequences
- Data Augmentation: Color jitter, affine transforms, blurring
- Edge Enhancement: Sobel edge detection as additional channel
- Image Preprocessing: Noise removal, contrast enhancement, aspect-preserving resize
- Batch Processing: Handles variable-width images via padding
- Python 3.10
- CUDA-capable GPU (recommended)
conda env create -f environment.ymlpip install -r requirements.txt
Training script
python -m src.train
Inference script: - runs inference on all test images (N=2000)
python -m src.evaluate
Predict 1 image
python -m src.predict path/to/captcha.png
Visualise augmentation:
python -m src.visualise_aug.py
Use config.yaml to adjust hyperparameters.
- Beamsearch does not yield results and slows computation down significantly given python's computational speed
- Mapping to HSV space instead of RGB doesn't yield significant improvements. It was initally tested for segmentation of overlapping characters with different colours.
- Addition of squeeze and excitation blocks did not yield significant improvement over the normal Res-Net. Refer to https://medium.com/@tahasamavati/squeeze-and-excitation-explained-387b5981f249
- An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition (Shi et al.’s)
- Deep Residual Learning for Image Recognition (He et al., 2015)
https://drive.google.com/drive/folders/1JikBA_bt7HwUYge73WuohRibamdsBTcC?usp=drive_link
