Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<!--
Thank you for contributing to keyvi!

Before submission, please ensure that you have read and agree to our
Before submission, please ensure that you have read and agree to our
contributor guidelines: https://github.com/KeyviDev/keyvi/blob/master/CONTRIBUTING.md.

Please delete these lines.
Expand Down
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ repos:
- id: check-toml
- id: check-yaml
- id: end-of-file-fixer
exclude: '^(.*\.svg)$'
- id: trailing-whitespace
exclude: '^(.*\.svg)$'
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: "v0.14.0"
hooks:
Expand Down
1 change: 0 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -199,4 +199,3 @@
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

4 changes: 2 additions & 2 deletions doc/RELEASE_PROCESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ Create a release branch called `release-X.Y.Z`
- Commit to `release-X.Y.Z` and push it to https://github.com/KeyviDev/keyvi/
- Wait for CI to build all targets

### Create tag
### Create tag
- Draft a new release tagged vX.Y.Z with `release-X.Y.Z` as the target branch
- Add the release notes in the description with references to PRs
- Publish release

## On the `master` branch

### Update the `python/setup.py` file
- Update to the next release version
- Update to the next release version
```
VERSION_MAJOR = X
VERSION_MINOR = Y
Expand Down
79 changes: 39 additions & 40 deletions doc/algorithm/Construction-Basics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Introduction

“An automaton (plural: automata) is a self-operating machine. The word is sometimes used to describe a robot, more
“An automaton (plural: automata) is a self-operating machine. The word is sometimes used to describe a robot, more
specifically an autonomous robot. Used colloquially, it refers to a mindless follower.” (Wikipedia)

### Minimal Acyclic Finite State Automata
Expand All @@ -19,7 +19,7 @@ Minimizing yields the FSA:
keyvi is using so called "incremental construction", the alternative are obviously non-incremental algorithms. If you are curious, there are
some instruction classes available on youtube.

keyvi is only about text/string automata. There are other use cases for finite state techniques, e.g. modeling
keyvi is only about text/string automata. There are other use cases for finite state techniques, e.g. modeling
real control flows.

### Incremental Construction by Watson/Daciuk
Expand All @@ -34,19 +34,19 @@ real control flows.
replace_or_register(LastState)‏
add_suffix(LastState, CurrentSuffix)‏
replace_or_register(q0)‏

func replace_or_register(State):
Child = last_child(State)‏
if (has_children (Child):
replace_or_register(Child)‏
if Register.find(Child):
last_child = Register[Child]
else:
Register.add(Child)
Register.add(Child)

![Construction example Daciuk/Watson](/doc/images/daciuk_watson.png)

### Incremental Construction in Keyvi
### Incremental Construction in Keyvi

#### Algorithm

Expand All @@ -55,17 +55,17 @@ real control flows.
while (another word):
new_word = next word in lexicographic order
common_prefix = common_prefix(current_word, new_word)‏
feed_stack ( current_word , length( common_prefix), length ( current_word ) )
feed_stack ( current_word , length( common_prefix), length ( current_word ) )
consume_stack ( length ( common_prefix ) )‏
current_word = new_word
feed_stack ( current_word, 0, length ( current_word ) )‏
consume_stack ( 0 )‏

func feed_stack( word, begin, end ):
for ( i = begin, i<end, ++i ):
unpacked_state_stack.insert(i, word[i])‏
unpacked_state_stack.insert(end, “final”)‏

func consume_stack(end):
while ( highest_stack > end ):
stack_entry = unpacked_state_stack.pop(highest_stack)‏
Expand All @@ -75,9 +75,9 @@ real control flows.

Code:
General entry point: [generator](/keyvi/src/cpp/dictionary/fsa/generator.h)

Unpacked_State_Stack: [unpacked_state_stack](/keyvi/src/cpp/dictionary/fsa/internal/unpacked_state_stack.h)

#### Illustration

Building a tiny automata containing just 4 strings:
Expand All @@ -88,47 +88,47 @@ Building a tiny automata containing just 4 strings:
abe

## Step1


![Step1](/doc/images/construction_step1.png)


## Step2


![Step2](/doc/images/construction_step2.png)


## Step3


![Step3](/doc/images/construction_step3.png)


## Step4


![Step4](/doc/images/construction_step4.png)


##Step5


![Step5](/doc/images/construction_step5.png)


## Step6


![Step6](/doc/images/construction_step6.png)


## Step7


![Step7](/doc/images/construction_step7.png)


#### Summary

The FSA is built from "right to left", the root state is written last.
Expand All @@ -138,4 +138,3 @@ The FSA is built from "right to left", the root state is written last.
- use sorted data characteristic: compare only the last two words
- no temporary state creation as with replace_or_register, which can be problematic depended on underlying data structure (e.g. Sparse Array)‏
- no recursion (as in replace_or_register)‏

6 changes: 3 additions & 3 deletions doc/algorithm/Extensibility.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
## Extensibility

The keyvi compiler is implemented in C++11 and uses templates to allow customization, like having a different
The keyvi compiler is implemented in C++11 and uses templates to allow customization, like having a different
persistence layer, different minimization, etc.

The most useful customization are different value types:

### Value types

Keys are always strings, values can be of any type, even nested types. Built-in types at time of writing are no-value,
Keys are always strings, values can be of any type, even nested types. Built-in types at time of writing are no-value,
integer, strings and json.

Value types have to implement a ["duck-type"](http://en.wikipedia.org/wiki/Duck_typing) interface.

Code: [IValue_Store](/keyvi/src/cpp/dictionary/fsa/internal/ivalue_store.h)

In a nutshell, writing a new value store entails: serialization of the value, the interface to the compiler and the
In a nutshell, writing a new value store entails: serialization of the value, the interface to the compiler and the
deserialization for the lookup.

Note: The compiler interface expects an ID for each unique value, the ID is used for minimization.
20 changes: 10 additions & 10 deletions doc/algorithm/Minimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ new state.

### keyvi Minimization

Minimization is implemented using a hash table. Each state that is written, is inserted into the hash table. Before
Minimization is implemented using a hash table. Each state that is written, is inserted into the hash table. Before
persisting a new state, we try to find a equal state in the hashtable.

Code:
Code:

Entry point of minimization: [sparse_array_builder](/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_builder.h)
Minimization Hashtable: [sparse_array_builder](/keyvi/src/cpp/dictionary/fsa/internal/minimization_hash.h)
Expand All @@ -17,7 +17,7 @@ The Hashtable in keyvi has a very small footprint of 12 bytes per entry.

## Getting best Compression ratios

Minimization/Compression is dependent on the data. FSA's are mainly used in computational linguistics, one of the reasons:
Minimization/Compression is dependent on the data. FSA's are mainly used in computational linguistics, one of the reasons:
FSA's make use of high ambiguity in languages.

![Compression](/doc/images/compression.png)
Expand All @@ -27,28 +27,28 @@ FSA's make use of high ambiguity in languages.
Therefore having "natural language keys" yields compression, both at prefix as well as suffix side. "Binary keys", e.g.
fingerprints, are pretty bad in terms of compression.

Note that Prefix compression is basically the same as in a trie.
Note that Prefix compression is basically the same as in a trie.

### Suffix and Value Compression

In contrast to a trie the FSA compresses suffixes as well. But note: The value is part of the suffix, as it is attached
In contrast to a trie the FSA compresses suffixes as well. But note: The value is part of the suffix, as it is attached
to the "final state". Therefore sparse values will yield best results while having totally unique values will result in
no suffix compression.

### Improving Compression Rate / Reducing Size

Normalize keys to gain better prefix compression.

Think about your values, reduce the version space if possible. For example: if you store integer values, think about their
Think about your values, reduce the version space if possible. For example: if you store integer values, think about their
range. Normalizing the integers reduce the number of unique values and therefore improve the compression ratio.

Take minimization into account: permutations and repetitions are a strength of the algorithm, e.g. storing tons of almost
identical keys pointing to the same value. In other data structures this can cause huge memory usage, FSA's are good in
minimizing that.

### Check Questions

1. You want to write a Date Extractor which can extract all dates of the fomat "YYYY-MM-DD". Estimate the size
requirement.
1. You want to write a Date Extractor which can extract all dates of the fomat "YYYY-MM-DD". Estimate the size
requirement.

2. Now assign a counter(incremented each time) to each key. What happens? What about your size estimate?
2. Now assign a counter(incremented each time) to each key. What happens? What about your size estimate?
9 changes: 4 additions & 5 deletions doc/algorithm/Persistence-Basics.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
## Persistence Introduction

The default persistence is implemented as sparse array (sparse table).
The default persistence is implemented as sparse array (sparse table).

Code: [sparse_array_persistence](/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_persistence.h)

### Sparse Array in a nutshell

The underlying data structure consists of 2 simple arrays of the same length(not size), a byte array and a
The underlying data structure consists of 2 simple arrays of the same length(not size), a byte array and a
pointer array(e.g. uint32_t)

![SparseArraySingleState](/doc/images/sparse_array_single_state.png)

A lookup starts at a given offset, a lookup succeeds if the numeric value (e.g. ASCII value) is found in the bucket
A lookup starts at a given offset, a lookup succeeds if the numeric value (e.g. ASCII value) is found in the bucket
defined by sum of the offset and the numeric value.

![SparseArrayPointer](/doc/images/sparse_array_pointer.png)

This check is required to allow interleaving of state vectors. To save space vectors of states are interleaved:

![SparseArrayInterleaved](/doc/images/sparse_array_mixed.png)

Even with a brute-force method that interleaves state vectors yields a very good compression rate.
Expand All @@ -30,4 +30,3 @@ The algorithm tries to find space in the existing sparse array:
![SparseArrayPacking](/doc/images/sparse_array_packing.png)

Code: [sparse_array_building](/keyvi/src/cpp/dictionary/fsa/internal/sparse_array_builder.h)

20 changes: 10 additions & 10 deletions doc/algorithm/Scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This page describes a number of performance and scaling tricks to make it possib
### Sorting

The construction algorithm requires sorted input. To be able to create a dictionary out of millions of keys we apply
external memory sorting. Fortunately "sorting" huge lists is not a problem these days. keyvi uses
external memory sorting. Fortunately "sorting" huge lists is not a problem these days. keyvi uses
[TPIE](http://madalgo.au.dk/tpie/) for external merge sort.

Note: Map-Reduce also sorts data using external memory sort, so using Map-Reduce with 1 Reducer would also give you
Expand All @@ -18,27 +18,27 @@ Code: [dictionary_compiler](/keyvi/src/cpp/dictionary/dictionary_compiler.h)

### Minimization

For each state the compiler stores a fingerprint of the state in the hashtable, although a fingerprint is stored in
For each state the compiler stores a fingerprint of the state in the hashtable, although a fingerprint is stored in
12 bytes the hashtable would not fit into main memory if you have lots of keys.

Therefore keyvi uses several hashtables organized by a LRU (Least Recently Used) Cache:
Therefore keyvi uses several hashtables organized by a LRU (Least Recently Used) Cache:

The 1st hashtable is filled with a limited number of entries, once full a new hash table is created. If the amount of
The 1st hashtable is filled with a limited number of entries, once full a new hash table is created. If the amount of
hashtables reaches the limit the last hashtable is thrown away.

To keep "good hashes": Each entry of a successful lookup in a lower hashtable will be moved to the top hashtable. Therefore
states which often minimize will stay in memory, while states which do not minimize will be thrown away over time.
states which often minimize will stay in memory, while states which do not minimize will be thrown away over time.

Code: [LRU Cache](/keyvi/src/cpp/dictionary/fsa/internal/lru_generation_cache.h)

### Compilation/Index Performance

Apart from low-level optimizations like avoiding object copies, pooling, short string optimization, good hash function etc.,
Apart from low-level optimizations like avoiding object copies, pooling, short string optimization, good hash function etc.,
keyvi uses some optimization on the algorithm side.

#### Minimization Stop

As described in Construction the FSA is build from 'right to left', minimization only works this way. Once a minimization
As described in Construction the FSA is build from 'right to left', minimization only works this way. Once a minimization
fails it is impossible to minimize the parent state. Therefore we stop minimization of the preceding states once it fails once.
Note: we still store the fingerprints in the hashtable for later minimizations.

Expand All @@ -48,9 +48,9 @@ Note: The amount of memory is configurable in the compiler. Increasing the limit

#### Packing

Sparse Array Construction is one of the most demanding parts. To speedup compilation we make use of bit vectors,
sliding windows and the [De Bruijn](http://en.wikipedia.org/wiki/De_Bruijn_sequence) sequence to quickly find spots to pack
the data, or - if available - intrinsic compiler/CPU functions.
Sparse Array Construction is one of the most demanding parts. To speedup compilation we make use of bit vectors,
sliding windows and the [De Bruijn](http://en.wikipedia.org/wiki/De_Bruijn_sequence) sequence to quickly find spots to pack
the data, or - if available - intrinsic compiler/CPU functions.

Code: [BitVector](/keyvi/src/cpp/dictionary/fsa/internal/bit_vector.h)

Expand Down
9 changes: 4 additions & 5 deletions doc/usage/Building keyvi dictionaries with python.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ The compiler is also available from python keyvi:

# repeat for every key
compiler.Add("foo")

# finally compile
compiler.Compile()
compiler.WriteToFile("/tmp/test.kv")

Other available Compiler in `keyvi.compiler`

type | details
----------------- | ---------------------------------------------------------------------------------------------
----------------- | ---------------------------------------------------------------------------------------------
integer | CompletionDictionaryCompiler
key-only | KeyOnlyDictionaryCompiler
string | StringDictionaryCompiler
Expand All @@ -24,8 +24,7 @@ json | JsonDictionaryCompiler
For dictionaries with values, Add takes the value as second parameter:

compiler.Add("foo", 42)

To ensure that you do not run out of disk space while compiling, set $TMPDIR to a disk with enough free space.

export TMPDIR=/mnt/tmp

export TMPDIR=/mnt/tmp
Loading
Loading