This project is an OCR (Optical Character Recognition) web interface for images/PDFs. The idea of this project is to study technologies like Python, Django, Tesseract(OCR), Continuous Integration, etc...
Firstly setup a Virtual Environment by following the below steps:
- Open Windows Powershell with Run as Administrator, and enter command
Set-ExecutionPolicy RemoteSigned. - Then install virtualenv, by using command
pip install virtualenv. - Go to the folder, where you want to create the folder for your website. (Let it be named as 'VirtualEnv' (You can choose any name!)).
- Open Windows Powershell in that folder and run command
virtualenv <folder_name>(Let <folder_name> be 'venv_folder'). - Go inside the new folder created (venv_folder) by running command
cd <folder_name>. - Then run command
./Scripts/activate. This will allow you to enter a functional virtual environment. - Then download the zip file of the code, and extract the zip inside this folder (<folder_name>, here 'venv_folder').
Then run the below commands:
pip install -r requirements.txt
python manage.py migrate
python manage.py runserver
So you can access in the local URL: localhost:8000
Inside the requirements.txt there is a package called pytesseract. It´s the wrapper to communicate with the Tesseract library (C/C++ code). So, the next step is to install the Tesseract itself.
For this, please follow the below instructions for your SO:
If an additional language is required, it is necessary to download it from here and move it to $TESSERACT_PATH/tessdata/
- After you run the above mentioned commands (till python manage.py runserver), go to the local host.
- There will be a file upload section in the centre of the screen. You can upload the file by clicking on the 'Browse' button, or by clicking anywhere on the input-file bar.
- Once you click as mentioned, it will open a dialog box to select the file to be uploaded, from your local machine.
- Select the file (acceptable file extensions are: .jpeg,.jpg,.png,.pdf; if uploaded file is of other extension, then it will show an alert), and click on 'Submit'.
- You will be presented with 2 links (one for .docx output file, and other for .pdf output file) to download the output file.
- The website is ready to use again, just after an output is given to user in form of those links.
- Django==4.0.4
- django-widget-tweaks==1.4.12
- Pillow==9.1.0
- pytesseract==0.3.9
- python-docx==0.8.11
- fpdf==1.7.2
- PyMuPDF==1.19.6
- BootStrap
- JQuery

