The Velexi Research Project Cookiecutter is intended to streamline the process of setting up a Jupyter-based research project involving computational work (but that is not necessarily centered around data science and/or machine learning models). The structure of this research project template is inspired by Cookiecutter Data Science, Kuygen Tran’s Data Science Template, the blog article “Jupyter Notebook Best Practices for Data Science” by Jonathan Whitmore.
Support for common research workflows (for both individuals and teams)
A directory structure that organizes and separates different components and stages of research: data, exploration/experimentation (e.g., Jupyter notebooks), documentation (e.g., reports, references), and software (e.g., custom functions and test code)
Integration with tools that encourage code, data, and scientific quality while promoting research efficiency.
Quick references for software tools (e.g., FastDS, MLflow, Poetry)
Support for the Julia programming language
Python package and dependency management using Poetry
Directory-based development environment isolation with direnv
1.2. Setting Up a New Research Project
1.3. Publishing Project Documentation to GitHub Pages
1.4. Known Issues
2.1. License
2.2. Repository Contents
2.4. Setting Up to Develop the Cookiecutter
2.5. Additional Notes
project_name
: project name
author
: project’s primary author
email
: primary author’s email
license
: type of license to use for the project
python_version
: Python versions compatible with project. See the
“Dependency sepcification” section
of the Poetry documentation for version specifier semantics.
enable_julia
: flag indicating whether Julia should be enabled for the
project
Prerequisites
Install Git.
Install Python 3.9 (or greater).
Install Poetry 1.2 (or greater).
Note. The project template uses poetry
instead of pip
for
management of Python package dependencies.
Install the Cookiecutter Python package.
Optional. Install direnv.
Use cookiecutter
to create a new research project.
$ cookiecutter https://github.com/velexi-research/VLXI-Cookiecutter-Research.git
Set up a dedicated virtual environment for the project. Any of the common
virtual environment options (e.g., venv
, direnv
, conda
) should work.
Below are instructions for setting up a direnv
or poetry
environment.
Note: to avoid conflicts between virtual environments, only one method should be used to manage the virtual environment.
direnv
Environment. Note: direnv
manages the environment for
both Python and the shell.
Prerequisite. Install direnv
.
Copy extras/dot-envrc
to the project root directory, and rename it to
.envrc
.
$ cd $PROJECT_ROOT_DIR
$ cp extras/dot-envrc .envrc
Grant permission to direnv to execute the .envrc file.
$ direnv allow
poetry
Environment. Note: poetry
only manages the Python
environment (it does not manage the shell environment).
Create a poetry
environment that uses a specific Python executable.
For instance, if python3
is on your PATH
, the following command
creates (or activates if it already exists) a Python virtual environment
that uses python3
for the project.
$ poetry env use python3
For commands to use other Python executables for the virtual environment, see the Poetry Quick Reference.
Install the base Python package dependencies.
$ poetry install
Configure Git.
Install the Git pre-commit hooks.
$ pre-commit install
Optional. Set up a remote Git repository (e.g., GitHub repository).
Create a remote Git repository.
Configure the remote Git repository.
$ git remote add origin GIT_REMOTE
where GIT_REMOTE
is the URL of the remote Git repository.
Push the main
branch to the remote Git repository.
$ git checkout main
$ git push -u origin main
Configure DVC.
Initialize DVC (Data Version Control). In the following command
PROJECT_DIR
should be replaced by the path to the newly created research
project.
Using fds
.
$ cd PROJECT_DIR
$ fds init
$ fds commit -m "Initialize DVC"
Using dvc
+ git
.
$ cd PROJECT_DIR
$ dvc init
$ git commit -m "Initialize DVC"
Add a remote DVC repository.
Set up a remote DVC repository (e.g., S3 bucket).
Configure the remote DVC repository.
$ dvc remote add -d storage DVC_REMOTE
where storage
is the name for the remote repository and DVC_REMOTE
is the URL to the remote DVC repository. Note: the -d
option
indicates that storage
should be used as the default remote DVC
repository.
Configure DVC to automatically stage changes to *.dvc
files with Git.
$ dvc config core.autostage true
Finish setting up the new research project.
Verify the copyright year and owner in the copyright notice. If the
project is licensed under Apache License 2.0, the copyright notice is
located in the NOTICE
file. Otherwise, the copyright notice is located
in the LICENSE
file.
Update the base Python package dependencies to the latest available versions.
$ poetry update
Review the Python package dependencies for the project, and modify them
as needed using the poetry
CLI tool. For a quick reference of poetry
commands, see the Poetry Quick Reference.
Packages that may be useful (but are not included by default):
For instance, to add numpy to the project dependencies, use the command:
$ poetry add numpy
Fill in any empty fields in pyproject.toml
.
Customize the README.md
file to reflect the specifics of the project.
If the project was created with Julia support enabled, configure the Julia package dependencies for the project
julia> ]
(...) pkg> instantiate
Commit all updated files (e.g., poetry.lock
, Project.toml
) to the
project Git repository.
From the project GitHub repository, navigate to “Settings” > “Pages” (in the
“Code and automation” section of the side menu) and configure GitHub Pages
to deploy from the main
branch.
In the “About” section of the project GitHub repository, set “Website” to the URL for the project GitHub Pages.
That’s it! Every time the main
branch is updated, GitHub will
automatically build project documentation from the README.md
file (and
any linked Markdown files) and publish them to the project GitHub Pages.
When including numba
as a project dependency, the Python version constraint
pyproject.toml
needs to be more restrictive than the default ^3.9
. For
numba 0.55, the Python version constraint in [tool.poetry.dependencies]
section of pyproject.toml
should be set to:
python = ">=3.9,<3.11"
The contents of this cookiecutter are covered under the Apache License 2.0
(included in the LICENSE
file). The copyright for this cookiecutter is
contained in the NOTICE
file.
├── README.md <- this file
├── RELEASE-NOTES.md <- cookiecutter release notes
├── LICENSE <- cookiecutter license
├── NOTICE <- cookiecutter copyright notice
├── cookiecutter.json <- cookiecutter configuration file
├── pyproject.toml <- Python project metadata file for
│ cookiecutter development
├── poetry.lock <- Poetry lockfile for cookiecutter
│ development
├── docs/ <- cookiecutter documentation
├── extras/ <- additional files that may be useful for
│ cookiecutter development
├── hooks/ <- cookiecutter scripts that run before
│ and/or after project generation
├── spikes/ <- experimental code
└── / <- cookiecutter template
See [tool.poetry.dependencies]
section of pyproject.toml
.
Set up a dedicated virtual environment for cookiecutter development.
See Step 3 from Section 2.1 for instructions on how to set up
direnv
and poetry
environments.
Install the Python packages required for development.
```shell $ poetry install
Install the Git pre-commit hooks.
$ pre-commit install
Make the cookiecutter better!
To update the Python dependencies for the template (contained in the
`` directory), use the following procedure to
ensure that Python package dependencies for developing the non-template
components of the cookiecutter (e.g., hooks/pre_gen_project.py
) do not
interfere with Python package dependencies for the template.
Create a local clone of the cookiecutter Git repository to use for cookiecutter development.
$ git clone git@github.com:velexi-research/VLXI-Cookiecutter-Research.git
Use cookiecutter
from the local cookiecutter Git repository to create an
instance of the template to use for updating Python package dependencies.
$ cookiecutter PATH/TO/LOCAL/REPO
In the instance of the template, perform the following steps to update the template’s Python package dependencies.
Set up a virtual environment for developing the template (e.g., a direnv environment).
Use poetry
or manually edit pyproject.toml
to (1) make changes to the
Python package dependency list and (2) update the versions of Python package
dependencies.
Use poetry
to update the Python package dependencies and versions recorded
in the poetry.lock
file.
Update /pyproject.toml
.
Copy pyproject.toml
from the instance of the template to
/pyproject.toml
.
Restore the templated values in the [tool.poetry]
section to the
following:
[tool.poetry]
name = "{{ cookiecutter.__project_name }}"
version = "0.0.0"
description = ""
license = "{% if cookiecutter.license == 'Apache License 2.0' %}Apache-2.0{% elif cookiecutter.license == 'BSD-3-Clause License' %}BSD-3-Clause{% elif cookiecutter.license == 'MIT License' %}MIT{% endif %}"
readme = "README.md"
authors = ["{{ cookiecutter.author }} <{{ cookiecutter.email }}>"]
Update /poetry.lock
.
poetry.lock
from the instance of the template to
/poetry.lock
.Commit the updated pyproject.toml
and poetry.lock
files to the Git
repository.