Bigmaze data

Michael Pereira d85f4f4aa2 added line to upgrade repository from v7 to v8 to instructions 2 weeks ago
data 8dd0e8fe45 deleted some deeplabcut results that correspond to test videos 2 months ago
models 16128737a0 added deeplabcut processed files in jersey 3 months ago
src 15b305d5e6 added excess steps and orthogonal choice probability analysis to new data 2 months ago
.gitconfig f2c35440a8 Added new configurations and instructions for accessing repository through zerotier 2 weeks ago
.gitignore 355a4f7760 added deeplabcut models, code to train and run them as well as some preprocessing scripts 3 months ago
README.md d85f4f4aa2 added line to upgrade repository from v7 to v8 to instructions 2 weeks ago
analysis_environment.yml f430235044 --ammend 2 months ago

README.md

Directory structure

This repository attempts to follow the relevant parts of the DAGsHub flavor of the directory structure convention proposed by cookiecutter-data-science (https://drivendata.github.io/cookiecutter-data-science/#directory-structure and https://dagshub.com/DAGsHub-Official/Cookiecutter-DVC). The standard will be incrementally implemented.

Directory tree

├── LICENSE
├── Makefile           <- Makefile with commands like `make dirs` or `make clean`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   ├─── raw           <- The original, immutable data dump.
│   └── discarded      <- The data that can't be used because of acquisition issues.
│
├── eval.dvc           <- The end of the data pipeline - evaluates the trained model on the test dataset.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── process_data.dvc   <- Process the raw data and prepare it for training.
├── raw_data.dvc       <- Keeps the raw data versioned.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│   └── metrics.txt    <- Relevant metrics after evaluating the model.
│   └── training_metrics.txt    <- Relevant metrics from training the model.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
├── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
└── train.dvc          <- Traing a model on the processed data.

What is git-annex and why do we want to use it?

Git is great for versioning and distributing code but doesn't deal well with large files as the repositories start to get slow when a lot of data is checked into them (anything above a couple GB). Additionally, typically the central repository locations that are commonly used for git (github, gitlab, etc.) have strict storage limits that don't fit large data collections.

Git-annex is an add-on for git that improves this. It offers the following advantages:

  1. Let's you check in very large amounts of data without compromising repository efficiency.
  2. Let's you store the repository metadata (file names, versions, modifications dates, etc.) and actual data separately. That way you can store very large amounts of data in one of the many available data backends/
  3. Let's you donwload the repository where you can browse all the available folders and data files without actually downloading the data. The data can then be downloaded on demand and space can be freed by deleting unnecessary data files at a later time. Git-annex will check that you will be able to re-download the data before actually deleting data, ensuring no data gets lost.
  4. All data files are checksum checked preventing any data corruption
  5. Data files can be automatically distributed and backed up across multiple locations by specifying rules for the minimum number of copies necessary for each file, etc.

Install git-annex

The easiest way to install git-annex is with conda. There is an official package for git-annex that can be installed with sudo apt-get install git-annex but that is too outdated.

Using conda

  1. conda create -n git-annex
  2. conda activate git-annex
  3. conda install git git-annex -c conda-forge (installs newest version of git-annex and also git (important) from the conda-forge repository)

Connecting to the git-annex data repository

Option 1: Install VPN

Ubuntu

Install Cisco Openconnect

sudo apt-get install network-manager-openconnect-gnome

Configure Champalimaud VPN

  1. Go to Settings -> Network and click the "+" button next to VPN
  2. Choose "Cisco AnnyConnect Compatible VPN (openconnect)"
  3. Choose a name for the network (e.g. 'FC VPN')
  4. Fill in '62.28.250.194' in the field 'Gateway'
  5. Click 'Add'
  6. In the dropdown menu for Network configuration on the upper right corner (where you can choose wifi network, etc.) click on the newly created VPN connection
  7. Fill in the username and password with your username (e.g. 'michael.pereira') and the same password you use for the FChampalimaud wifi network and connect

Cloning the git-annex repository

Open terminal and browse to the directory where you want to download the data. This command will not download all the data but just an index of the available data files.

  1. git clone https://dagshub.com/michaelfsp/bigmaze.git
  2. cd bigmaze
  3. git config --local include.path ../.gitconfig
  4. git annex init "alice" (alice is used as an example description for the local repository. the only effect is to help the user identify all the clones of the repository. feel free to choose one that makes sense to you.)
  5. git annex sync

Option 2: Use zerotier instead of VPN

Zerotier is a freemium service and open source application that lets you connect to a group of computers by making them available under local network IP addresses. After installing and setting up zerotier, in practice it is as if the two computers were physically in the same network.

Ubuntu

  1. Execute the following command in the terminal curl -s https://install.zerotier.com | sudo bash
  2. Ask Michael for zerotier hexadecimal network address ;-)
  3. sudo zerotier-cli join hexadecimalnetworkaddress where hexadecimalnetworkaddress is a hexadecimal code you will get in the previous step
  4. Ask Michael to authorize your computer in the zerotier network
  5. Update git-annex to latest version. If installed in a conda environment after activating the environment simply conda update --all
  6. Upgrade git-annex repository: git annex upgrade --backend=BLAKE2B256E
  7. In the repository directory git pull
  8. Apply the new configurations with git config --local include.path ../.gitconfig
  9. Perform sync with git annex sync --no-commit --content

Option 3: Connect using tor (currently not working)

git-annex offers the possibility of two clients connecting to each other using the tor network. This is handy because it lets clients connect even when they have changed IP address or can't get access to the target machine's local network.

Install tor

Ubuntu

Up-to-data information on how to install tor on Ubuntu can be found at https://support.torproject.org/apt/

Ubuntu 18.04 (Bionic Beaver)
  1. sudo apt install apt-transport-https curl
  2. sudo sh -c 'echo "deb https://deb.torproject.org/torproject.org/ bionic main" >> /etc/apt/sources.list.d/tor.list'
  3. sudo sh -c 'echo "deb-src https://deb.torproject.org/torproject.org/ bionic main" >> /etc/apt/sources.list.d/tor.list'
  4. curl https://deb.torproject.org/torproject.org/A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89.asc | gpg --import
  5. gpg --export A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89 | sudo apt-key add -
  6. sudo apt update
  7. sudo apt install tor deb.torproject.org-keyring

Working with the git-annex data repository

  • to update the repository to reflect latest changes: git annex sync
  • to get a data file: git annex get /path/to/the/datafile/

Data Analysis

Requirements

It is necessary to add the path to MazeX and pyControl/tools to $PYTHONPATH environment variable. The most appropriate way would be to add small bash scripts to the scripts which are automatically executed to activate and deactivate a conda environment. These files can be found in $ENVIRONMENT_PATH/etc/conda/activate.d and $ENVIRONMENT_PATH/etc/conda/deactivate.d respectively.

One can create a file with name pythonpath.sh in the activate.d directory with the following content:

export PREV_PYTHONPATH=$PYTHONPATH
export PYTHONPATH="/home/michael/code/mousemaze:/home/michael/src/pyControl/tools:$PYTHONPATH"

and a file with name pythonpath.sh in the deactivate.d directory with the following content:

export PYTHONPATH=$PREV_PYTHONPATH