Skip to content

Commit

Permalink
Added further dvc notes
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewcoole committed Sep 2, 2024
1 parent bd185c2 commit 75d0117
Showing 1 changed file with 29 additions and 1 deletion.
30 changes: 29 additions & 1 deletion dvc.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# DVC Notes
Notes on how to accomplish various operation using DVC. Most of these are just distilled notes from the [DVC documentation](https://dvc.org/doc).

## Setting up the repository
## Installation
DVC can be installed using `pip`, this will provide the basic CLI needed to execute commands with DVC:
```shell
$ pip install dvc
```
As well as the main package you will also need to install a package to work with the particular remote that is intended to store the data files tracked by dvc e.g. to work with dvc data stored in an S3 bucket:
```shell
$ pip install dvc[s3]
```
Hopefully if working with an existing repo the `requirements.txt` will contain the appropriate package for dvc tracked data.

## Setting up a new repository
To set up a git repository to track any data files using DVC use `dvc init` in the repository's directory, then `dvc remote add` to add a remote store to for your DVC tracked data e.g.
```shell
$ dvc init
Expand All @@ -11,13 +22,30 @@ This will initialise the repository for use with DVC and setup a `remote` called

You will also see a `.dvc/` directory and a `.dvcignore` file which you should add to you version controlled files.

## Cloning a repository
When cloning a repository that is set up with data tracked by dvc, use `dvc pull` to download the data files tracked by dvc from the remote:
```shell
$ git clone [email protected]:NERC-CEH/llm-eval.git
$ cd llm-eval
$ dvc pull
```

## Making changes
To make changes to your data use `dvc add` on the local file and then use `dvc push` to push to the remote store. It is then important to commit the `.dvc` object to git that is setup to track this file as well e.g.
```shell
$ dvc add data.csv
$ dvc push
$ git commit data.csv.dvc -m 'Updated data file'
```
`data.csv.dvc` is a place holder that dvc creates to tell it about the files/folder being tracked. This place holder will be tracked by git and the actual data tracked by dvc.

### Moving data from git to dvc
Any files or folders that you add to dvc must not be tracked by git. To switch from tracking a file with git to dvc, first untrack it with git:
```shell
$ git rm --cached data.csv
```
then follow the steps above to add the file(s) to be tracked by dvc.

> Note: Whilst `dvc` commands seem to somewhat mirror `git` commands, there doesn't seem to be quite the same concept of a staging area. I would suggest that `dvc add` is more like an amalgamation (in DVC) of `git add` + `git commit`.
## Checking out versions
Expand Down

0 comments on commit 75d0117

Please sign in to comment.