The Velexi Dataset Cookiecutter is intended to streamline the process of a creating dataset that
improves the reproducibility of research analysis (including machine learning experiments, data science analysis, and traditional scientific and engineering studies) by applying version control principles to datasets,
facilitates efficient exploration of datasets by standardizing the directory structure used to organize data,
increases reuse of datasets across projects by decoupling datasets from data analysis code, and
simplifies dataset maintenance by keeping dataset management code (e.g., clean up scripts) with the dataset.
A simple, consistent dataset directory structure
Quick references for dataset maintenance tools (e.g., FastDS)
Pre-configured to support development of Python software tools
Integration with code and data quality tools (e.g., pre-commit)
2.1. License
2.2. Repository Contents
2.4. Setting Up to Develop the Cookiecutter
2.5. Additional Notes
dataset_name
: dataset nname
author
: dataset’s primary author (or maintainer)
email
: : primary author’s (or maintainer’s) email
dataset_license
: type of license to use for the dataset
software_license
: type of license to use for supporting software
python_version
: Python versions compatible with project. See the
“Dependency sepcification” section
of the Poetry documentation for version specifier semantics.
Prerequisites
Install Git.
Install Python 3.9 (or greater).
Install Poetry 1.2 (or greater).
Note. The dataset template uses poetry
instead of pip
for
management of Python package dependencies.
Install the Cookiecutter Python package.
Optional. Install direnv.
Use cookiecutter
to create a new Python project.
$ cookiecutter https://github.com/velexi-research/VLXI-Cookiecutter-Dataset.git
Set up dedicated virtual environment for the project. Any of the common
virtual environment options (e.g., venv
, direnv
, conda
) should work.
Below are instructions for setting up a direnv
or poetry
environment.
Note: to avoid conflicts between virtual environments, only one method should be used to manage the virtual environment.
direnv
Environment. Note: direnv
manages the environment for
both Python and the shell.
Prerequisite. Install direnv
.
Copy extras/dot-envrc
to the project root directory, and rename it to
.envrc
.
$ cd $PROJECT_ROOT_DIR
$ cp extras/dot-envrc .envrc
Grant permission to direnv to execute the .envrc file.
$ direnv allow
poetry
Environment. Note: poetry
only manages the Python
environment (it does not manage the shell environment).
Create a poetry
environment that uses a specific Python executable.
For instance, if python3
is on your PATH
, the following command
creates (or activates if it already exists) a Python virtual environment
that uses python3
for the project.
$ poetry env use python3
For commands to use other Python executables for the virtual environment, see the Poetry Quick Reference.
Install the Python package dependencies (e.g., pre-commit, DVC, FastDS).
$ poetry install
Configure Git.
Install the Git pre-commit hooks.
$ pre-commit install
Optional. Set up a remote Git repository (e.g., GitHub repository).
Create a remote Git repository.
Configure the remote Git repository.
$ git remote add origin GIT_REMOTE
where GIT_REMOTE
is the URL of the remote Git repository.
Push the main
branch to the remote Git repository.
$ git checkout main
$ git push -u origin main
Optional. Configure remote storage for DVC (e.g., an AWS S3 bucket).
Create remote storage for the dataset. Below are instructions for setting up a storage on the local file system or AWS S3.
Local File System. Create a directory that DVC can use to store a copy of the dataset (outside of the working dataset directory).
AWS S3. Create an S3 bucket that DVC can use for remote storage.
Configure the remote DVC storage for the dataset.
$ dvc remote add -d origin DVC_REMOTE
$ fds commit "Add DVC remote storage."
where DVC_REMOTE
is the URL of the remote storage for the dataset
(e.g., the path to a directory on the local file system or the URL to
the S3 bucket).
Important Note. The name of the remote storage must be set to
“origin” if fds
is used to push data to remote storage. If the name is
not set to “origin”, dvc
must be used to push data to remote storage.
Configure the credentials that DVC should use to connect to remote storage.
Note: the --local
option ensures that these DVC configurations are
stored in a local configuration file (.dvc/config.local
) that should
not be committed to the Git repository.
AWS S3. Set the AWS profile, AWS access keys, or credentials file. See DVC Documentation: Amazon S3 and Compatible Servers for more options. For example, to set the AWS profile (to use for accessing the “origin” remote storage), use the following command.
$ dvc remote modify --local origin profile AWS_PROFILE
where AWS_PROFILE
is the AWS profile that should be used to access S3.
Finish setting up the new dataset.
Verify the copyright year and owner in the copyright notice.
Note. If the software components of the dataset are licensed under
Apache License 2.0, the software copyright notice is located in the
NOTICE
file. Otherwise, the software copyright notice is located in the
LICENSE
file.
Update the Python package dependencies to the latest available versions.
$ poetry update
Fill in any empty fields in pyproject.toml
.
Customize the README.md
file to reflect the specifics of the dataset.
Commit all updated files (e.g., poetry.lock
) to the dataset Git
repository.
When the dataset includes data from third-party sources, be sure to include
a reference to the source and license information (if available) in the
DATASET-NOTICE
file.
The cookiecutter currently only supports two DVC remote storage providers: (1) AWS S3 and (2) the local file system. To use one of the other remote storage providers supported by DVC, use the following steps.
Select None
when cookiecutter
prompts you for the
dvc_remote_storage_provider
.
Add the optional dependencies of the dvc
Python package that are required
for the DVC remote storage type. For instance, to install the packages for
supporting Microsoft Azure, use
$ poetry add dvc[azure]
Follow Step #6 from Section #1.2 using your choice of DVC remote storage.
The contents of this cookiecutter are covered under the Apache License 2.0
(included in the LICENSE
file). The copyright for this cookiecutter is
contained in the NOTICE
file.
├── README.md <- this file
├── RELEASE-NOTES.md <- cookiecutter release notes
├── LICENSE <- cookiecutter license
├── NOTICE <- cookiecutter copyright notice
├── cookiecutter.json <- cookiecutter configuration file
├── pyproject.toml <- Python project metadata file for cookiecutter
│ development
├── poetry.lock <- Poetry lockfile
├── docs/ <- cookiecutter documentation
├── extras/ <- additional files that may be useful for
│ cookiecutter development
├── hooks/ <- cookiecutter scripts that run before and/or
│ after project generation
├── spikes/ <- experimental code
└── / <- cookiecutter template
See [tool.poetry.dependencies]
section of pyproject.toml
.
Set up a dedicated virtual environment for cookiecutter development.
See Step 3 from Section 2.1 for instructions on how to set up
direnv
and poetry
environments.
Install the Python packages required for development.
```shell $ poetry install
Install the Git pre-commit hooks.
$ pre-commit install
Make the cookiecutter better!
To update the Python dependencies for the template (contained in the `` directory), use the following procedure to ensure that package dependencies for developing the non-template components of the cookiecutter do not interfere with package dependencies for the template.
Create a local clone of the cookiecutter Git repository to use for cookiecutter development.
$ git clone git@github.com:velexi-research/VLXI-Cookiecutter-Dataset.git
Use cookiecutter
from the local cookiecutter Git repository to create a
clean project for template dependency updates.
$ cookiecutter PATH/TO/LOCAL/REPO
In the pristine project, perform the following steps to update the template’s package dependencies.
Set up a virtual environment for developing the template (e.g., a direnv environment).
Use poetry
or manually edit pyproject.toml
to (1) make changes to the
package dependency list and (2) update the package dependency versions.
Use poetry
to update the package dependencies and versions recorded in
the poetry.lock
file.
Update /pyproject.toml
.
Copy pyproject.toml
from the pristine project to
/pyproject.toml
.
Restore the templated values in the [tool.poetry]
section to the
following:
[tool.poetry]
name = "{{ cookiecutter.__project_name }}"
version = "0.1.0"
description = ""
license = "{% if cookiecutter.license == 'Apache License 2.0' %}Apache-2.0{% elif cookiecutter.license == 'BSD-3-Clause License' %}BSD-3-Clause{% elif cookiecutter.license == 'MIT License' %}MIT{% endif %}"
readme = "README.md"
authors = ["{{ cookiecutter.author }} <{{ cookiecutter.email }}>"]
Update /poetry.lock
.
poetry.lock
from the pristine project to
/poetry.lock
.Commit the updated pyproject.toml
and poetry.lock
files to the Git
repository.