Skip to content

Streaming parser#205

Open
Daniel-Diaz wants to merge 6 commits intosnoyberg:masterfrom
Daniel-Diaz:master
Open

Streaming parser#205
Daniel-Diaz wants to merge 6 commits intosnoyberg:masterfrom
Daniel-Diaz:master

Conversation

@Daniel-Diaz
Copy link
Copy Markdown

I added the ability to consume yaml files incrementally using conduits. This is exposed by the parseStream function in the module Data.Yaml.Internal.

The motivation was simply wanting to consume large files without having to have them fully in memory. I don't know what the demand for something like this is, but I wanted it.

I was thinking maybe to define a wrapper in Data.Yaml that would be easier to use and discover. However, I wasn't sure what that would look like. In any case, exporting the conduit works well enough for myself.

I also had to export ParseState or otherwise I wasn't able to run the parser.

To try it out, I wrote the following code:

{-# LANGUAGE ImportQualifiedPost, OverloadedStrings #-}

module Main (main) where

import Data.Void (Void)
import Data.Aeson (FromJSON, parseJSON, withObject, (.:), fromJSON)
import Data.Aeson qualified as JSON
import Conduit (ConduitM, runConduit, (.|), runResourceT)
import Data.Conduit.List qualified as CL
import Data.Yaml.Internal (Parse, ParseState (..), parseStream)
import Control.Monad.Reader (runReaderT)
import Control.Monad.State.Strict (evalStateT)
import Text.Libyaml (decodeFile)

data ID = ID { id_field :: Int }

instance FromJSON ID where
  parseJSON = withObject "ID" $ \o -> fmap ID $ o .: "id"

main :: IO ()
main = do
  let f :: Int -> JSON.Value -> Int
      f i v =
        case fromJSON v of
          JSON.Error err -> error err
          JSON.Success x -> i + id_field x
  let conduit :: ConduitM () Void Parse Int
      conduit = decodeFile "big.yaml"
             .| runReaderT parseStream []
             .| CL.fold f 0
  r <- runResourceT $ evalStateT (runConduit conduit) $ ParseState mempty []
  print r

And then I created a 1,5Gb file (big.yaml). The memory still goes up as the program runs, but does so very slowly and the program finishes before it is a problem (it ends up taking close to 3Gb though). Trying to parse the entire file quickly used all of my available memory (31,2Gb). I was expecting the conduit version to take a roughly constant amount of memory. I wonder what causes the slow build up. Could it be related to the parser state?

Thanks in advance for your review.

me <- lift CL.head
case me of
Just (EventSequenceStart _ _ a) -> parseArrayStreaming 0 a
_ -> liftIO $ throwIO $ UnexpectedEvent me $ Just $ EventSequenceStart NoTag AnySequence Nothing
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some values here for the "expected" event, hoping that would be more helpful that using Nothing.

-- In combination with 'Y.decodeFile', it can be used to consume
-- a yaml file that consists in a very large list of values.
parseStream :: ReaderT JSONPath (ConduitM Event Value Parse) ()
parseStream = do
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the name of this if there's a better name for it.

Comment thread yaml/yaml.cabal

name: yaml
version: 0.11.7.0
version: 0.11.8.0
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this myself. I don't know why it kept happening.

@snoyberg
Copy link
Copy Markdown
Owner

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

@Daniel-Diaz
Copy link
Copy Markdown
Author

Daniel-Diaz commented May 15, 2022

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

OK. It might be a good idea, but I don't have the time right now to create a separate project. Perhaps another time. For now, I will just use my fork and merge any updates from your repository.

The reason I added it here instead of on my own code is that it uses internal functions that are not exported, and I also thought it might be useful to other people. I simply copied parseAll and modified it to yield the values downstream instead of gathering them in a list. So it's just a modification of one the functions that's already there. This is also why ReaderT is in the place it is.

Thanks for the prompt reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants