This repository was archived by the owner on Dec 31, 2025. It is now read-only.
Closed
Conversation
Contributor
|
Can you remind me of the status of this? |
Contributor
Author
|
The PR to libpff is still not merged. The build is passing because it's using my fork. Implementation details are still the same as described in the original comment. It pulls out more files than |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes https://github.com/alephdata/ingestors/issues/13.
Some notes about the implementation:
In this implementation, we open the pst file and go through all the folders recursively. All these folders are exported to a temp directory. Similarly all the emails and other files that we can recognize are also exported to the temp directory while maintaining the folder hierarchy. Then we feed that temp directory to the DirectoryIngestor.
There are 2 issues with the implementation as far as I can see.
Ideally, we should be parsing the email files once. But with this implementation, we'll end up parsing the files twice; once to export them and then again to ingest them.
Some files are not parsed correctly. For example, some messages don't have transport headers. So they are parsed as html files. But some of these html files have attachments. I'm just exporting the attachments as separate file in the same parent folder for now. Similarly, some messages only have RTF text in them. Aleph tries to show them as PDF documents which of course fails.
To avoid parsing the files twice, I tried implementing the pst ingestor in a non-recursive way. But that didn't work out well because Aleph kind of expects it to be recursive or else doesn't create any child document for a nested result.
On the libpff side of things, this PR kind of depends on libyal/libpff#69 getting merged. So I'm waiting on that to fix the build. Or else we could just build from source from a fork.