VLXI-Cookiecutter-Dataset

Velexi Dataset Cookiecutter

Dataset = Data + Supporting Software Tools

The Velexi Dataset Cookiecutter is intended to streamline the process of a creating dataset that

Features


Table of Contents

  1. Usage

    1.1 Cookiecutter Parameters

    1.2 Setting Up a New Dataset

    1.3 Dataset License Considerations

    1.4. Using an Unsupported DVC Remote Storage Provider

  2. Contributor Notes

    2.1. License

    2.2. Repository Contents

    2.3. Software Requirements

    2.4. Setting Up to Develop the Cookiecutter

    2.5. Additional Notes

  3. References


1. Usage

1.1. Cookiecutter Parameters

1.2. Setting Up a New Dataset

  1. Prerequisites

    • Install Git.

    • Install Python 3.9 (or greater).

    • Install Poetry 1.2 (or greater).

      Note. The dataset template uses poetry instead of pip for management of Python package dependencies.

    • Install the Cookiecutter Python package.

    • Optional. Install direnv.

  2. Use cookiecutter to create a new Python project.

    $ cookiecutter https://github.com/velexi-research/VLXI-Cookiecutter-Dataset.git
    
  3. Set up dedicated virtual environment for the project. Any of the common virtual environment options (e.g., venv, direnv, conda) should work. Below are instructions for setting up a direnv or poetry environment.

    Note: to avoid conflicts between virtual environments, only one method should be used to manage the virtual environment.

    • direnv Environment. Note: direnv manages the environment for both Python and the shell.

      • Prerequisite. Install direnv.

      • Copy extras/dot-envrc to the project root directory, and rename it to .envrc.

        $ cd $PROJECT_ROOT_DIR
        $ cp extras/dot-envrc .envrc
        
      • Grant permission to direnv to execute the .envrc file.

        $ direnv allow
        
    • poetry Environment. Note: poetry only manages the Python environment (it does not manage the shell environment).

      • Create a poetry environment that uses a specific Python executable. For instance, if python3 is on your PATH, the following command creates (or activates if it already exists) a Python virtual environment that uses python3 for the project.

        $ poetry env use python3
        

        For commands to use other Python executables for the virtual environment, see the Poetry Quick Reference.

  4. Install the Python package dependencies (e.g., pre-commit, DVC, FastDS).

    $ poetry install
    
  5. Configure Git.

    • Install the Git pre-commit hooks.

      $ pre-commit install
      
    • Optional. Set up a remote Git repository (e.g., GitHub repository).

      • Create a remote Git repository.

      • Configure the remote Git repository.

        $ git remote add origin GIT_REMOTE
        

        where GIT_REMOTE is the URL of the remote Git repository.

      • Push the main branch to the remote Git repository.

        $ git checkout main
        $ git push -u origin main
        
  6. Optional. Configure remote storage for DVC (e.g., an AWS S3 bucket).

    • Create remote storage for the dataset. Below are instructions for setting up a storage on the local file system or AWS S3.

      • Local File System. Create a directory that DVC can use to store a copy of the dataset (outside of the working dataset directory).

      • AWS S3. Create an S3 bucket that DVC can use for remote storage.

    • Configure the remote DVC storage for the dataset.

      $ dvc remote add -d origin DVC_REMOTE
      $ fds commit "Add DVC remote storage."
      

      where DVC_REMOTE is the URL of the remote storage for the dataset (e.g., the path to a directory on the local file system or the URL to the S3 bucket).

      Important Note. The name of the remote storage must be set to “origin” if fds is used to push data to remote storage. If the name is not set to “origin”, dvc must be used to push data to remote storage.

    • Configure the credentials that DVC should use to connect to remote storage. Note: the --local option ensures that these DVC configurations are stored in a local configuration file (.dvc/config.local) that should not be committed to the Git repository.

      • AWS S3. Set the AWS profile, AWS access keys, or credentials file. See DVC Documentation: Amazon S3 and Compatible Servers for more options. For example, to set the AWS profile (to use for accessing the “origin” remote storage), use the following command.

        $ dvc remote modify --local origin profile AWS_PROFILE
        

        where AWS_PROFILE is the AWS profile that should be used to access S3.

  7. Finish setting up the new dataset.

    • Verify the copyright year and owner in the copyright notice.

      Note. If the software components of the dataset are licensed under Apache License 2.0, the software copyright notice is located in the NOTICE file. Otherwise, the software copyright notice is located in the LICENSE file.

    • Update the Python package dependencies to the latest available versions.

      $ poetry update
      
    • Fill in any empty fields in pyproject.toml.

    • Customize the README.md file to reflect the specifics of the dataset.

    • Commit all updated files (e.g., poetry.lock) to the dataset Git repository.

1.3. Dataset License Considerations

When the dataset includes data from third-party sources, be sure to include a reference to the source and license information (if available) in the DATASET-NOTICE file.

1.4. Using an Unsupported DVC Remote Storage Provider

The cookiecutter currently only supports two DVC remote storage providers: (1) AWS S3 and (2) the local file system. To use one of the other remote storage providers supported by DVC, use the following steps.


2. Contributor Notes

2.1. License

The contents of this cookiecutter are covered under the Apache License 2.0 (included in the LICENSE file). The copyright for this cookiecutter is contained in the NOTICE file.

2.2. Repository Contents

├── README.md               <- this file
├── RELEASE-NOTES.md        <- cookiecutter release notes
├── LICENSE                 <- cookiecutter license
├── NOTICE                  <- cookiecutter copyright notice
├── cookiecutter.json       <- cookiecutter configuration file
├── pyproject.toml          <- Python project metadata file for cookiecutter
│                              development
├── poetry.lock             <- Poetry lockfile
├── docs/                   <- cookiecutter documentation
├── extras/                 <- additional files that may be useful for
│                              cookiecutter development
├── hooks/                  <- cookiecutter scripts that run before and/or
│                              after project generation
├── spikes/                 <- experimental code
└── /  <- cookiecutter template

2.3. Software Requirements

Base Requirements

Optional Packages

Python Packages

See [tool.poetry.dependencies] section of pyproject.toml.

2.4. Setting Up to Develop the Cookiecutter

  1. Set up a dedicated virtual environment for cookiecutter development. See Step 3 from Section 2.1 for instructions on how to set up direnv and poetry environments.

  2. Install the Python packages required for development.

    ```shell $ poetry install

  3. Install the Git pre-commit hooks.

    $ pre-commit install
    
  4. Make the cookiecutter better!

2.5. Additional Notes

Updating Cookiecutter Template Dependencies

To update the Python dependencies for the template (contained in the `` directory), use the following procedure to ensure that package dependencies for developing the non-template components of the cookiecutter do not interfere with package dependencies for the template.


3. References