diff --git a/.github/workflows/actions.yml b/.github/workflows/actions.yml index f75fb1b..645edba 100644 --- a/.github/workflows/actions.yml +++ b/.github/workflows/actions.yml @@ -57,20 +57,6 @@ jobs: run: | pytest test --run-slow - docs: - needs: [test, check-style] - if: ${{ github.ref == 'refs/heads/main' }} - runs-on: ubuntu-latest - steps: - - name: Checkout 🔖 - uses: actions/checkout@v3 - - name: Deploy docs - uses: mhausenblas/mkdocs-deploy-gh-pages@master - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - CONFIG_FILE: mkdocs.yml - REQUIREMENTS: requirements-docs.txt - pypi-release: needs: [test, check-style] if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') }} diff --git a/README.md b/README.md index 2fec481..83096b7 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,26 @@ -
- -
+# `dac`: A CLI Helper Tool for Data as Code -# `dac`: Data as Code +[Data as Code](https://data-as-code.github.io/docs/) (DaC) is a paradigm of distributing versioned data as code. -Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (python) code. - -
- -
+`dac` is a tool that [supports the Producer](https://data-as-code.github.io/docs/#3-use-the-dac-cli-tool). **IMPORTANT**: Currently the project is in the initial development phase, this is why releases are marked as `0.z.y`. (following [semantic versioning 2.0.0](https://semver.org/): "Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable."). While in this phase, we will denote breaking changes with a minor increase. -## 📔 [User documentation](https://data-as-code.github.io/dac/) +## Quickstart + +You can install `dac` as a regular python package + +```sh +python -m pip install dac +``` + +Then use the integrated help to find out its functionalities +```sh +dac --help +``` ## Setup development environment (for contributors only) diff --git a/docs/examples.md b/docs/examples.md deleted file mode 100644 index 45b3600..0000000 --- a/docs/examples.md +++ /dev/null @@ -1,5 +0,0 @@ -# Examples - -In [Energy DaC](https://gitlab.com/data-as-code/energy-dac-example) you can pip install some energy-related data as code. -The Readme will guide you through a demo. -You can also inspect the repo to see how the DaC package was built using `dac`. diff --git a/docs/img/logo.jpg b/docs/img/logo.jpg deleted file mode 100644 index 9116edb..0000000 Binary files a/docs/img/logo.jpg and /dev/null differ diff --git a/docs/img/motto.png b/docs/img/motto.png deleted file mode 100644 index 41d4e7a..0000000 Binary files a/docs/img/motto.png and /dev/null differ diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index 8250364..0000000 --- a/docs/index.md +++ /dev/null @@ -1,80 +0,0 @@ -# `dac`: Data as Code - -
- drawing -
- -Data-as-Code (DaC) `dac` is a tool that supports the distribution of data as (python) code. - -
- drawing -
- -## How will the Data Scientists use a DaC package? - -Say that the Data Engineers prepared the `demo-data` as code for you. Then you will install the code in your environment -```sh -python -m pip install demo-data -``` -and then you will be able to access the data simply with -```python -from demo_data import load - -data = load() -``` - -Data can be in any format. There is no constraint of any kind. - -Not only accessing data will be this easy but, depending on how data were prepared, you may also have access to useful metadata. How? -```python -from demo_data import Schema -``` - -With the schema you could, for example - -* access the column names (e.g. `Schema.my_column`) -* unit test your functions by getting a data example with `Schema.example()` - -## How can a Data Engineer provide a DaC python package? - -Install this library -```sh -python -m pip install dac -``` -and use the command `dac pack` (run `dac pack --help` for detailed instructions). - -On a high level, the most important elements you must provide are: - -* python code to load the data -* a `Schema` class that at very least contains a `validate` method, but possibly also - - - data field names (column names, if data is tabular) - - an `example` method - -* python dependencies - -!!! hint "Use `pandera` to define the Schema" - - If the data type you are using is supported by [`pandera`](https://pandera.readthedocs.io/en/stable/index.html) consider using a [`DataFrameModel`](https://pandera.readthedocs.io/en/stable/dataframe_models.html) to define the Schema. - - -## What are the advantages of distributing data in this way? - -* The code needed to load the data, the data source, and locations are abstracted away from the user. - This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code. - -* *If you provide data field names in `Schema`* (e.g. `Schema.column_1`), the user code will not contain hard-coded column names, and changes in data source field names won't impact the user. - -* *If you provide the `Schema.example` method*, users will be able to build robust code by writing unit testing for their functions effortlessly. - -* Semantic versioning can be used to communicate significant changes: - - * a patch update corresponds to a fix in the data: its intended content is unchanged - * a minor update corresponds to a change in the data that does not break the schema - * a major update corresponds to a change in the schema, or any other breaking change - - In this way data pipelines can subscribe to the appropriate updates. Furthermore, it will be easy to keep releasing data updates maintaining retro-compatibility (one can keep deploying `1.X.Y` updates even after version `2` has been rolled-out). - -* Description of the data and columns can be included in the schema, and will therefore reach the user together with the data. - -* Users will always know where to look for data: the PyPi index.