Skip to content
This repository was archived by the owner on Jun 12, 2020. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 287 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
===========
Treelib API
===========

Library for manipulating trees in Python made up of dicts and lists.


Goals
=====

The primary goal of this library is to make it less unwieldy to manipulate trees
made up of Python dicts and lists.

For example, say we want to get a value deep in the tree. we could do this::

value = tree['a']['b']['c']


That'll throw a ``KeyError`` if any of those bits are missing. So you could
handle that::

try:
value = tree['a']['b']['c']
except KeyError:
value = None


Alternatively, you could do this::

value = tree.get('a', {}).get('b', {}).get('c': None)


These work, but both are unwieldy especially if you're doing this a lot.

Similarly, setting things deep is also unenthusing::

tree['a']['b']['c'] = 5


The safer form is this::

tree.setdefault('a', {}).setdefault('b', {})['c'] = 5


This library aims to make sane use cases for tree manipulation easier to read
and think about.


Paths
=====

A path is a string specifying a period-delimited list of edges. Edges can be:

@erikrose erikrose Aug 11, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're rediscovering ClearSilver. I don't say that's good or bad—I've had some good times with ClearSilver—but you can look to it for inspirations and pitfalls.


1. a key (for a dict)
2. an index (for a list)

Example paths::

a
a.[1].foo_bar.Bar
a.b.[-1].Bar


Paths can be composed using string operations since they're just strings.

FIXME(willkg): Add diagram showing a tree with edges specified by a path.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths are based on, but not as expressive as, jq filters: https://stedolan.github.io/jq/tutorial/


Key
---

Keys are identifiers that are:

1. composed entirely of ascii alphanumeric characters, hyphens, and underscores
2. at least one character long

For example, these are all valid keys::

a
foo
FooBar
Foo-Bar
foo_bar


Index
-----

Indexes indicate a 0-based list index. They are:

@erikrose erikrose Aug 11, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a dict key is "0"? Not a problem because you're eventually using []? Seems like you still might have trouble because you're saying some_dict[0] rather than some_dict["0"].

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're wrapping the path elements in [] all the time. That solves it.


1. integers
2. wrapped in ``[`` and ``]``
3. can be negative

For example, these are all valid indexes::

[0]
[1]
[-50]


API
===

.. py:func:: tree_get(tree, path, default=None)

Given a tree consisting of dicts and lists, returns the value specified by
the path.

Some things to know about ``tree_get()``:

1. It doesn't alter the tree.
2. Once it hits an edge that's missing, it returns the default.

Examples:

>>> tree_get({'a': 1}, 'a')
1
>>> tree_get({'a': 1}, 'b')
None
>>> tree_get({'a': {'b': 2}}, 'a.b')
2
>>> tree_get({'a': {'b': 2}}, 'a.b.c', default=55)
55
>>> tree_get({'a': {'1': 2}}, 'a.1')
2
>>> tree_get({'a': [1, 2, 3]}, 'a.[1]')
2
>>> tree_get({'a': [{}, {'b': 'foo'}]}, 'a.[1].b')
'foo'


.. py:func:: tree_set(tree, path, value, mutate=True, create_missing=False)

Given a tree consisting of dicts and lists, sets the item specified by path
to the specified value.

If one of the edges doesn't exist, then this raises either a ``KeyError``
for dicts or a ``IndexError`` for lists.

:arg boolean mutate: If ``mutate`` is ``True`` (the default), then this
changes the tree in place and returns the mutated tree.

If ``mutate`` is ``False``, then this does a deepcopy of the tree,
changes the copy, and returns the copy. This is expensive.

:arg boolean create_missing: If ``create_missing`` is ``False`` (the default),
then this will raise a ``KeyError`` for failed dict keys and
``IndexError`` for failed list indexes.

If ``create_missing`` is ``True``, and this isn't
the last item in the path, then this will create the intermediary
dict/list.

If the next edge is a key, it'll create a dict. If the next edge is an
index, then it'll create a list filling in ``None`` for the required
indices.

Here are some examples.

This sets ``a`` to 5. This isn't affected by ``create_missing``.

>>> tree_set({}, 'a', value=5, create_missing=True)
{'a': 5}
>>> tree_set({}, 'a', value=5, create_missing=False)
{'a': 5}

This tries to traverse ``a``, but it doesn't exist and it's not the last
edge in the path. The next edge is ``b``, which is a key, so it first sets
``a`` to an empty dict, then proceeds.

>>> tree_set({}, 'a.b', value=5, create_missing=True)
{'a': {'b': 5}}

This tries to traverse ``a``, but it doesn't exist and it's not the last
edge in the path. The next edge is ``[2]``, which is an index, so it first
sets ``a`` to a list of 3 ``None`` values, then proceeds.

>>> tree_set({}, 'a.[2]', value=5, create_missing=True)
{'a': [None, None, 5]}

This is similar, but with a negative index.

>>> tree_set({}, 'a.[-1]', value=5, create_missing=True)
{'a': [5]}

This creates missing indices in an existing list.

>>> tree_set({'a': []}, 'a.[2]', value=5, create_missing=True)
{'a': [None, None, 5]}


Examples:

These don't mutate the tree:

>>> tree = {'a': {'b': {'c': 1}}}
>>> tree_set(tree, 'a', value=5, mutate=False)
{'a': 5}
>>> tree_set(tree, 'a.b.c', value=[], mutate=False)
{'a': {'b': {'c': []}}}

These raise errors if an edge is missing:

>>> tree_set({}, 'a.b.c', value=5)
KeyError ...
>>> tree_set({}, 'a.[1].b', value=5)
IndexError ...

These create missing edges and indexes:

>>> tree_set({}, 'a.b.c', value=5, create_missing=True)
{'a': {'b': {'c': 5}}}
>>> tree_set({}, 'a.[1].b', value=5, create_missing=True)
{'a': [None, {'b': 5}]}


.. py:func:: tree_flatten(tree)

Flattens a tree into a dict with keys of paths.

>>> tree_flatten({'a': 1})
{'a': 1}
>>> tree_flatten({'a': {'b': 1, 'c': 2}})
{'a.b': 1, 'a.c': 2}
>>> tree_flatten({'a': [{'b': 1}, {'c': 2}]})
{'a.[0].b': 1, 'a.[1].c': 2}

.. Note::

At this point, a flattened tree can't be used using ``tree_get`` and
``tree_set``.


.. py:func:: tree_setdefault(tree, default_tree)

FIXME


.. py:func:: tree_validate(tree, schema)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your consideration, my favorite Python schema lib is https://pypi.python.org/pypi/schema/, by the docopt guy. I've done some work on it. I use it in DXR. I've made some design proposals that I think would result in it being an elegant solution rather than being a bit slapdash in places.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema looks interesting. I'll keep that in mind.


FIXME

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a schema system for dicts for pyvideo a while back.

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/schemalib.py

With an example schema here:

https://github.com/pyvideo/old-pyvideo-data/blob/master/src/clive/pyvideo_schema.py

Pretty sure there are other schema systems out there, too.

Not sure we'd need this in the first version of treevert.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a schema system in use in the data pipeline? Do they already have a tool for it?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I've seen, the bulk of the telemetry data pipeline is not in Python. So I'm not sure we could use anything they've made without switching languages.

I'll ask around, though.



.. py:func:: tree_traverse(tree, fun)

FIXME

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to have a traversal system, but we wouldn't need this in the first version of treevert.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait on this. Maybe on the flattening API, also, unless we had a clear use case for it up front.



Research and Inspirations
=========================

Python ``defaultdict``
----------------------

Python has a defaultdict

https://docs.python.org/3/library/collections.html#defaultdict-objects

This doens't handle lists and dicts well, though.

We'd have to either create the original data structure as a defaultdict, or
convert it to one.

If you try to get something deep from a defaultdict, it mutates the
structure.

It doesn't easily support composable paths.


jq processor
------------

jq has interesting filter syntax.

https://stedolan.github.io/jq/manual/#Basicfilters


Creating a new subclass of Python ``dict``
------------------------------------------

We could do that and add ``get_path`` and ``set_path``, but I wonder if we can
get the utility we want without having to box/unbox data.

If we're just working with dicts and lists and standard Python things, then
``json.dumps`` and other things just work without us having to do anything about

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you familiar with custom json.dumps and loads decoders and encoders? You can arrange it so you get fancier, custom types out rather than vanilla dicts.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that for datetimes, but that feature json loads/dumps doesn't solve the big problem. json.dumps here is just an example.

The big problem is that in order to change the data structure of raw_crash and processed_crash, we have to change all the code that touches those things. If the data structure is a subclass of dict, that helps, but it's not a panacea. Switching data structures is huge change that'd take a while to do and it's sort of all-or-nothing.

One of my hopes for treevert is that it gives us an interim step that we can convert to incrementally in the short term without having to commit to a project-wide rewrite.

them.