- Install Python 3.9 or higher. You can either use a virtual environment or your global environment. I have used global environment. For virtual environment, use
python3 -m venv venvand activate it via command./venv/bin/activate. - Install the dependencies using
pip install -r requirements.txt - Run the program using
python3 main.py. This will start the fastAPI server onlocalhost:8000. You can browse to http://localhost:8000 for accessing the app. - The program will automatically identify if GPU is available and can use it for inference.
- I have used the Salesforce's BLIP Model for inference. BLIP stands for Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. You can also try a different image captioning model by changing the appropriate lines initializing the model and tokenizer in
main.py. You can read more about the model in their paper.
Note: The program will download the model and tokenizer if they are not already present. This might take a while for the first run. BLIP is ~990 MiB in size. Have some patience while executing the program :) . Also, if you don't have a GPU the requests might take more time to process.