Bigmaze data

Michael Pereira 15b305d5e6 added excess steps and orthogonal choice probability analysis to new data 3 weeks ago
data 8dd0e8fe45 deleted some deeplabcut results that correspond to test videos 3 weeks ago
models 16128737a0 added deeplabcut processed files in jersey 2 months ago
src 15b305d5e6 added excess steps and orthogonal choice probability analysis to new data 3 weeks ago
.gitconfig 0fce6172c9 updated .gitconfig 1 month ago
.gitignore 355a4f7760 added deeplabcut models, code to train and run them as well as some preprocessing scripts 2 months ago
README.md e58b16dbbf minor correction in README.md 1 month ago
analysis_environment.yml f430235044 --ammend 1 month ago

README.md

Directory structure

This repository attempts to follow the relevant parts of the DAGsHub flavor of the directory structure convention proposed by cookiecutter-data-science (https://drivendata.github.io/cookiecutter-data-science/#directory-structure and https://dagshub.com/DAGsHub-Official/Cookiecutter-DVC). The standard will be incrementally implemented.

Directory tree

├── LICENSE
├── Makefile           <- Makefile with commands like `make dirs` or `make clean`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   ├─── raw           <- The original, immutable data dump.
│   └── discarded      <- The data that can't be used because of acquisition issues.
│
├── eval.dvc           <- The end of the data pipeline - evaluates the trained model on the test dataset.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── process_data.dvc   <- Process the raw data and prepare it for training.
├── raw_data.dvc       <- Keeps the raw data versioned.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│   └── metrics.txt    <- Relevant metrics after evaluating the model.
│   └── training_metrics.txt    <- Relevant metrics from training the model.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
├── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
└── train.dvc          <- Traing a model on the processed data.

What is git-annex and why do we want to use it?

Git is great for versioning and distributing code but doesn't deal well with large files as the repositories start to get slow when a lot of data is checked into them (anything above a couple GB). Additionally, typically the central repository locations that are commonly used for git (github, gitlab, etc.) have strict storage limits that don't fit large data collections.

Git-annex is an add-on for git that improves this. It offers the following advantages:

  1. Let's you check in very large amounts of data without compromising repository efficiency.
  2. Let's you store the repository metadata (file names, versions, modifications dates, etc.) and actual data separately. That way you can store very large amounts of data in one of the many available data backends/
  3. Let's you donwload the repository where you can browse all the available folders and data files without actually downloading the data. The data can then be downloaded on demand and space can be freed by deleting unnecessary data files at a later time. Git-annex will check that you will be able to re-download the data before actually deleting data, ensuring no data gets lost.
  4. All data files are checksum checked preventing any data corruption
  5. Data files can be automatically distributed and backed up across multiple locations by specifying rules for the minimum number of copies necessary for each file, etc.

Install git-annex

The easiest way to install git-annex is with conda. There is an official package for git-annex that can be installed with sudo apt-get install git-annex but that is too outdated.

Using conda

  1. conda create -n git-annex
  2. conda activate git-annex
  3. conda install git git-annex -c conda-forge (installs newest version of git-annex and also git (important) from the conda-forge repository)

Connecting to the git-annex data repository

Option 1: Install VPN

Ubuntu

Install Cisco Openconnect

sudo apt-get install network-manager-openconnect-gnome

Configure Champalimaud VPN

  1. Go to Settings -> Network and click the "+" button next to VPN
  2. Choose "Cisco AnnyConnect Compatible VPN (openconnect)"
  3. Choose a name for the network (e.g. 'FC VPN')
  4. Fill in '62.28.250.194' in the field 'Gateway'
  5. Click 'Add'
  6. In the dropdown menu for Network configuration on the upper right corner (where you can choose wifi network, etc.) click on the newly created VPN connection
  7. Fill in the username and password with your username (e.g. 'michael.pereira') and the same password you use for the FChampalimaud wifi network and connect

Cloning the git-annex repository

Open terminal and browse to the directory where you want to download the data. This command will not download all the data but just an index of the available data files.

  1. git clone https://dagshub.com/michaelfsp/bigmaze.git
  2. cd bigmaze
  3. git config --local include.path ../.gitconfig
  4. git annex init "alice" (alice is used as an example description for the local repository. the only effect is to help the user identify all the clones of the repository. feel free to choose one that makes sense to you.)
  5. git annex sync

Option 2: Connect using tor (currently not working)

git-annex offers the possibility of two clients connecting to each other using the tor network. This is handy because it lets clients connect even when they have changed IP address or can't get access to the target machine's local network.

Install tor

Ubuntu

Up-to-data information on how to install tor on Ubuntu can be found at https://support.torproject.org/apt/

Ubuntu 18.04 (Bionic Beaver)
  1. sudo apt install apt-transport-https curl
  2. sudo sh -c 'echo "deb https://deb.torproject.org/torproject.org/ bionic main" >> /etc/apt/sources.list.d/tor.list'
  3. sudo sh -c 'echo "deb-src https://deb.torproject.org/torproject.org/ bionic main" >> /etc/apt/sources.list.d/tor.list'
  4. curl https://deb.torproject.org/torproject.org/A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89.asc | gpg --import
  5. gpg --export A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89 | sudo apt-key add -
  6. sudo apt update
  7. sudo apt install tor deb.torproject.org-keyring

Working with the git-annex data repository

  • to update the repository to reflect latest changes: git annex sync
  • to get a data file: git annex get /path/to/the/datafile/

Data Analysis

Requirements

It is necessary to add the path to MazeX and pyControl/tools to $PYTHONPATH environment variable. The most appropriate way would be to add small bash scripts to the scripts which are automatically executed to activate and deactivate a conda environment. These files can be found in $ENVIRONMENT_PATH/etc/conda/activate.d and $ENVIRONMENT_PATH/etc/conda/deactivate.d respectively.

One can create a file with name pythonpath.sh in the activate.d directory with the following content:

export PREV_PYTHONPATH=$PYTHONPATH
export PYTHONPATH="/home/michael/code/mousemaze:/home/michael/src/pyControl/tools:$PYTHONPATH"

and a file with name pythonpath.sh in the deactivate.d directory with the following content:

export PYTHONPATH=$PREV_PYTHONPATH