Skip to content

Commit 9b8de97

Browse files
committed
little cleanup, do OCR until the borders
1 parent c7b4427 commit 9b8de97

5 files changed

Lines changed: 5 additions & 39 deletions

File tree

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
.gitattributes export-ignore
33
requirements.txt export-ignore
44
doc export-ignore
5+
scripts export-ignore
56
README.md export-ignore

README.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ You have two options for storing your OCRed PDFs:
4949

5050
1. **Google drive**:
5151
- You need to first create a google api in the developers console, and turn on the google drive api [as described here](https://developers.google.com/drive/v3/web/quickstart/python#step_1_turn_on_the_api_name).
52-
- Copy the resulting `client_secret.json` into this projects root, then `pip install oauth2client` and then run `python get_drive_credentials`. Now, copy-paste the resulting values into the environment variables. This grants your lambda function to create files in your google drive and to access the files it created (which it won't need). See [here](https://developers.google.com/drive/v2/web/about-auth) for more details about the right you're granting.
52+
- Copy the resulting `client_secret.json` into this projects root, then `pip install oauth2client` and then run `python scripts/get_drive_credentials`. Now, copy-paste the resulting values into the environment variables. This grants your lambda function to create files in your google drive and to access the files it created (which it won't need). See [here](https://developers.google.com/drive/v2/web/about-auth) for more details about the right you're granting.
5353
- Optional: If you wish your PDFs to be stored in a specific folder, go to that folder in your google drive, copy the part in the url after `/folders/` and put that into an additional environment variabled named `GDRIVE_FOLDER`
5454
2. **S3**: This is a lot easier as you'll only need to create an s3 bucket (in the same region as your lambda function) and add these lines to your policy (replace `<dest-bucket>`):
5555
```
@@ -108,14 +108,9 @@ From the `Add triggers` menu on the left choose `S3`, then in `Configure trigger
108108
cd root/of/repo
109109
virtualenv --python=python3.6 .
110110
pip install -r requirements.txt
111-
rm -f ocr-lambda.zip
112-
git archive -o ocr-lambda.zip HEAD
113-
cd lib/python3.6/site-packages
114-
zip -r ../../../ocr-lambda.zip .
115-
cd -
116-
zip ocr-lambda.zip tessdata/*.traineddata # if you use additional languages
111+
scripts/zip.sh
117112
aws s3 cp ocr-lambda.zip s3://<s3-bucket>/
118-
aws lambda update-function-code --function-name <lamba-name> --s3-bucket <s3-bucket> --s3-key ocr-lambda.zip
113+
aws lambda update-function-code --function-name <lambda-name> --s3-bucket <s3-bucket> --s3-key ocr-lambda.zip
119114
```
120115

121116
# Further docs

gdrive.py

Lines changed: 0 additions & 31 deletions
This file was deleted.

ocr.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ def ocr(tar_gz_filename, empty_page_threshold, language='eng'):
2525
output = PdfWriter()
2626
for filename in tar.getnames():
2727
cmd = ['./tesseract', '-l', language,
28+
'-c', 'min_orientation_margin=0', # don't leave out characters close to border
2829
'{}/{}'.format(TMP_DIR, filename),
2930
'{}/partial'.format(TMP_DIR),
3031
'pdf']

0 commit comments

Comments
 (0)