Skip to content

Sterne, take 2

Choose a tag to compare

@bmschmidt bmschmidt released this 30 Jul 20:23
· 14 commits to dev since this release

Note: As described below version will not work without creating a new database called bookworm_scratch and configuring it properly. You can do this automatically by typing "python OneClick.py doctor" into a new clone of the Presidio repository with a bookworm.cnf file already defined.

It's being left only on dev for the time being for that reason.

0.4-alpha

This version makes a major change to the underlying architecture of queries: instead of using derived tables, all intermediate queries are stored as temporary tables instead. This may have some costs on RAM, but is dramatically faster for most queries on very large databases. (For instance, with a 6m document Chronicling America db, some queries that were previously taking about 5-10 seconds now take 0.5 seconds).

These gains are made primarily through better caching: the component parts of subqueries were not previously being cached, but now they are. (There's also some gains on very large results from indexes). So the improvements won't show up the first time you redefine the corpus for a query, but should for subsequent ones.

To work, existing bookworm installation will need to change two things:

  1. You need to create a new database called bookworm_scratch, with read and write privileges for the non-admin user. This scratch DB is being used instead of the bookworm's own db to keep the edits sandboxed from the main bookworm installation. This can be done in a single command, python OneClick.py doctor, from any bookworm installation with the latest version of Presidio installed.
  2. You need to make sure your query cache is working properly; MySQL 5.6 changed its defaults from 5.5 so that the cache was generally off. The automatic setup script in Presidio in /etc/mysqlSetup will handle this, or you can do it by hand. Some decent values are below: As always, restarting a server takes some overhead in recreating the memory tables.
query_cache_limit = 1M
query_cache_size = 32M
query_cache_type = 1

This shift also allows something we've been discussing for years: a 'hasword' query in a key. It's not fully up to the new spec, but will be in the next release.