Rather slow processing for 0.1 % of documents

Hello @abimaelmartell , 

I have run this package on around 100k public procurement pdf documents and overall it's quite robust and fast. However there are rare documents which are rather slow to process .

For instance below is the distribution of compute time spend by run time percentile. And in my case for instance 0.1% or 114 documents took 40% of the compute time. 

  | Slice | # Docs | Processing time at threshold | % of Total Time |
  |---|---|---|---|
  | Top 50% (above median) | 56,637 | 0.03s | 97.9% |
  | Top 25% (above q75) | 28,319 | 0.07s | 92.5% |
  | Top 10% (above q90) | 11,328 | 0.19s | 84.5% |
  | Top 1% (above q99) | 1,133 | 2.18s | 62.6% |
  | Top 0.1% (above q99.9) | 114 | 19.08s | 40.4% |
  | **Total** | **113,274** | — | **24,115s (~6.7h)** |

If I look at more detail what those documents are,

| Rank | Pages | Size (MB) | Time (s) |
  |---|---|---|---|
  | 1 | 1,069 | 486.5 | 3903 |
  | 2 | 66 | 124.5 | 222.5 |
  | 3 | 1,326 | 82.2 | 213.9 |
  | 4 | 134 | 199.4 | 182.9 |


so there is one 1000 pages document that too 1 hour to process, others not that large took 3.5 min. 

I'm still investigating, but I think these may be related to documents with lots of vector graphics and the table extraction code not handling it well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rather slow processing for 0.1 % of documents #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slice	# Docs	Processing time at threshold	% of Total Time
Top 50% (above median)	56,637	0.03s	97.9%
Top 25% (above q75)	28,319	0.07s	92.5%
Top 10% (above q90)	11,328	0.19s	84.5%
Top 1% (above q99)	1,133	2.18s	62.6%
Top 0.1% (above q99.9)	114	19.08s	40.4%
Total	113,274	—	24,115s (~6.7h)

Rather slow processing for 0.1 % of documents #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions