Skip to content

Design doc on static evidence for grouping/serialization #1886

@johnynek

Description

@johnynek

@ttim has a design and even PR: #1857 to improve performance in scalding. The idea is to move towards requiring evidence that we can do binary sorting without deserializing needed for sort-partitioning data. It would also be interesting to optionally take static evidence we can serialize the values as well.

We do have a PR, but this is probably the most major change to scalding since we introduced the typed API. I think we should make a few page google doc and iterate on that to minimize the pain of adoption.

For instance, I think we could possibly introduce a SerializationProducer type, which is something like:

trait SerializationProducer[A] {
  def build(conf: Config, mode: Mode): Serialization[A] = ...
}

so we can defer building the actual serializers until just before job submission. In this way, we can get the config of the job to set serialization options. Something like this would be needed to support the current Kryo stuff, which has Config-based options.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions