https://github.com/pymupdf/pymupdf4llm this repo converts pdfs to text. There's no semantic treesitter support (I found the markdown grammar underwhelming), but would you take a PR that takes PDFs in and indexes them by chunks (with seek etc)? Claude doesn't handle pdfs very well by default. I've also implemented this on my branch (https://github.com/DieracDelta/coderlm), and have had some really good success with it so far. It's a huge improvement from the default behavior of giving up when pdfs are too large. And the recursion helps with the top level context length as usual
https://github.com/pymupdf/pymupdf4llm this repo converts pdfs to text. There's no semantic treesitter support (I found the markdown grammar underwhelming), but would you take a PR that takes PDFs in and indexes them by chunks (with seek etc)? Claude doesn't handle pdfs very well by default. I've also implemented this on my branch (https://github.com/DieracDelta/coderlm), and have had some really good success with it so far. It's a huge improvement from the default behavior of giving up when pdfs are too large. And the recursion helps with the top level context length as usual