Skip to content

Rather slow processing for 0.1 % of documents #5

@rth

Description

@rth

Hello @abimaelmartell ,

I have run this package on around 100k public procurement pdf documents and overall it's quite robust and fast. However there are rare documents which are rather slow to process .

For instance below is the distribution of compute time spend by run time percentile. And in my case for instance 0.1% or 114 documents took 40% of the compute time.

Slice # Docs Processing time at threshold % of Total Time
Top 50% (above median) 56,637 0.03s 97.9%
Top 25% (above q75) 28,319 0.07s 92.5%
Top 10% (above q90) 11,328 0.19s 84.5%
Top 1% (above q99) 1,133 2.18s 62.6%
Top 0.1% (above q99.9) 114 19.08s 40.4%
Total 113,274 24,115s (~6.7h)

If I look at more detail what those documents are,

Rank Pages Size (MB) Time (s)
1 1,069 486.5 3903
2 66 124.5 222.5
3 1,326 82.2 213.9
4 134 199.4 182.9

so there is one 1000 pages document that too 1 hour to process, others not that large took 3.5 min.

I'm still investigating, but I think these may be related to documents with lots of vector graphics and the table extraction code not handling it well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions