A Sleeper instance contains one or more tables. Each table must have a name and schema. A table also has a state store for storing metadata about the table, and it can be taken offline to disable certain background operations.
All resources for the instance, such as the S3 bucket used for storing data in a table, ECS clusters and lambda functions are shared across all the tables.
Each table has metadata associated to it. This metadata is stored in a state store and consists of information about files that are in the system, and the partitions. See the design documentation for more information.
The implementation of this can be chosen in the table property sleeper.table.statestore.classname, but usually this
should be left as the default value.
Sleeper can apply processing to table data such that all data in the table is seen to have that processing pre-applied. For example, this can be used to combine rows with the same values for row keys and sort keys, or to age off old data. See more information on this in the data processing document.
Scripts can be used to add, rename and delete tables in a Sleeper instance. If using the scripts, creating a new table will consist of the following steps:
- Use the
estimateSplitPoints.shscript to estimate split points from your data. - Use the
addTable.shscript to create the table. - Use the
reinitialiseTable.shscript with the split points from the first step. - Use the
sendToIngestBatcher.shscript to send your data to the ingest batcher to be added to the table.
All of these scripts will rely on a schema for your table, which should be created first. See creating a schema for how to set up a schema for your table.
We also have scripts to rename and delete a table, and to take it offline / online. You can also edit table properties
with adminClient.sh.
Here's an example of how you might use these together to create and add data to a table:
cat ./scripts/templates/schema.template
{
"rowKeyFields": [
{
"name": "key",
"type": "StringType"
}
],
"valueFields": [
{
"name": "value",
"type": "StringType"
}
]
}
ID=my-instance-id
./scripts/utility/estimateSplitPoints.sh ./scripts/templates/schema.template 128 100000 32768 splits.file s3a://my-bucket/file.parquet
./scripts/utility/addTable.sh $ID table1
./scripts/utility/reinitialiseTable.sh $ID table1 true splits.file
./scripts/utility/sendToIngestBatcher.sh $ID table1 my-bucket/file.parquetWe'll look at the table scripts below. See the ingest batcher documentation for more information on
sendToIngestBatcher.sh.
Before you create a Sleeper table, consider pre-splitting partitions for the table. If you do not do this, your state
store will be initialised with a single root partition. When a bulk import is submitted, the system will pre-split the
partitions automatically to a minimum number, set in the table property sleeper.table.bulk.import.min.leaf.partitions,
documented here. That will assume the bulk import job contains a representative
sample of data. If multiple bulk import jobs are submitted simultaneously, they will attempt to pre-split separately,
which can waste compute resources. It is often worthwhile to pre-split the table yourself.
One way to do this is by taking a sample of your data to generate a split points file:
./scripts/utility/estimateSplitPoints.sh <schema-file> <num-partitions> <read-max-rows-per-file> <sketch-size> <output-split-points-file> <parquet-paths-as-separate-args>The schema file should be the schema.json file you created for your table.
You can calculate the number of partitions by dividing the total number of rows you expect for your table by the average number of rows you want per partition.
The estimate will be based on the given number of rows from the start of each input file. If your data is such that the beginning of a file will not be representative of the distribution of row keys, you can either read more rows, or prepare a representative sample first.
The sketch size controls the size and accuracy of the data sketches used to estimate the split points. It should be a power of 2 greater than 2 and less than 65536. See the Apache DataSketches documentation for more information:
https://datasketches.apache.org/docs/Quantiles/ClassicQuantilesSketch.html
The paths to your sample data can be specified as a path in your local file system, or you can use the s3a:// scheme to
give a path in an S3 bucket like s3a://my-bucket/my-prefix/file.parquet.
You can apply the resulting split points when adding a table by setting an absolute path to the output file in the
table property sleeper.table.splits.file. If you've created a table but haven't added any data yet, you can apply a
change to this by reinitialising the table. In the future it will not be necessary to set this property when using the
instance configuration folder structure, see issue #583.
The addTable.sh script will create a new table with properties defined in templates/tableproperties.template, and a
schema defined in templates/schema.template. Currently any changes must be done in those templates or in the admin
client. We will add support for declarative deployment in the future.
cd scripts
editor templates/tableproperties.template
editor templates/schema.template
./utility/addTable.sh <instance-id> <table-name>Reinitialising a table means deleting all its contents. This can sometimes be useful when you are experimenting with Sleeper or if you created a table with the wrong schema.
You can reinitialise the table quickly by running the following command:
./scripts/utility/reinitialiseTable.sh <instance-id> <table-name> <optional-delete-partitions-true-or-false> <optional-split-points-file-location> <optional-split-points-file-base64-encoded-true-or-false>For example
./scripts/utility/reinitialiseTable.sh sleeper-my-sleeper-config my-sleeper-table true /tmp/split-points.txt falseIf you want to change the table schema you'll need to change it directly in the table properties file in the S3 config bucket, and then reinitialise the table. An alternative is to delete the table and create a new table with the same name.
You can rename or delete a table using the following commands:
./scripts/utility/renameTable.sh <instance-id> <old-table-name> <new-table-name>
./scripts/utility/deleteTable.sh <instance-id> <table-name>You can also pass --force as an additional argument to deleteTable.sh to skip the prompt to confirm you wish to delete
all the data. This will permanently delete all data held in the table, as well as metadata.
You can take a table offline or put it online with the following commands:
./scripts/utility/takeTableOffline.sh <instance-id> <table-name>
./scripts/utility/putTableOnline.sh <instance-id> <table-name>These scripts will set the table property sleeper.table.online, and update an index of table status to match.
You are still able to ingest files to offline tables, and perform queries against them. Here are some operations that will not run for offline tables:
- Compaction job creation
- Partition splitting
- State store snapshot creation/deletion
- State store transaction deletion
- Garbage collection, unless the instance property
sleeper.run.gc.offlineis set totrue - Table metrics computation, unless the instance property
sleeper.run.table.metrics.offlineis set totrue