Hello @abimaelmartell ,
I have run this package on around 100k public procurement pdf documents and overall it's quite robust and fast. However there are rare documents which are rather slow to process .
For instance below is the distribution of compute time spend by run time percentile. And in my case for instance 0.1% or 114 documents took 40% of the compute time.
| Slice |
# Docs |
Processing time at threshold |
% of Total Time |
| Top 50% (above median) |
56,637 |
0.03s |
97.9% |
| Top 25% (above q75) |
28,319 |
0.07s |
92.5% |
| Top 10% (above q90) |
11,328 |
0.19s |
84.5% |
| Top 1% (above q99) |
1,133 |
2.18s |
62.6% |
| Top 0.1% (above q99.9) |
114 |
19.08s |
40.4% |
| Total |
113,274 |
— |
24,115s (~6.7h) |
If I look at more detail what those documents are,
| Rank |
Pages |
Size (MB) |
Time (s) |
| 1 |
1,069 |
486.5 |
3903 |
| 2 |
66 |
124.5 |
222.5 |
| 3 |
1,326 |
82.2 |
213.9 |
| 4 |
134 |
199.4 |
182.9 |
so there is one 1000 pages document that too 1 hour to process, others not that large took 3.5 min.
I'm still investigating, but I think these may be related to documents with lots of vector graphics and the table extraction code not handling it well.
Hello @abimaelmartell ,
I have run this package on around 100k public procurement pdf documents and overall it's quite robust and fast. However there are rare documents which are rather slow to process .
For instance below is the distribution of compute time spend by run time percentile. And in my case for instance 0.1% or 114 documents took 40% of the compute time.
If I look at more detail what those documents are,
so there is one 1000 pages document that too 1 hour to process, others not that large took 3.5 min.
I'm still investigating, but I think these may be related to documents with lots of vector graphics and the table extraction code not handling it well.