Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  mining software repositories chatgpt challenge Integration:  dvc git github
Jakub Narębski 410c8b7733
utils.git: Use decode_c_quoted_str() in _parse_blame_porcelain()
1 year ago
92ff5b46f6
dvc: Initialized DVC repository
1 year ago
f71c4e074b
dvc.lock: Run "*_similarities" stages (from checkpoint)
1 year ago
fc6db4e600
compare.ipynb: Define checkpointed_json() function
1 year ago
4f9a584ca7
Add reports/figures/infographics-flow_diagram.mermaid Sankey diagram
1 year ago
04bf0bc272
find_chatgpt_changes_similarities.py: Expand file docstring
1 year ago
src
410c8b7733
utils.git: Use decode_c_quoted_str() in _parse_blame_porcelain()
1 year ago
92ff5b46f6
dvc: Initialized DVC repository
1 year ago
34723a97f7
.gitignore: Add .virtual_documents/ (for jupyterlab-lsp)
1 year ago
9fd8f9c3aa
Initial commit
1 year ago
3c790e042e
Update README.md
1 year ago
f71c4e074b
dvc.lock: Run "*_similarities" stages (from checkpoint)
1 year ago
68f399846a
dvc.yaml: Add "pr_*" and "issue_similarities" stage
1 year ago
a2480da95f
init.bash: Configure with local variables at top of the script
1 year ago
f4b4fd4afa
requirements.txt: Put 'seaborn' after 'matplotlib'
1 year ago
4ee114ea67
Add setup.py created by cookiecutter-data-science
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

How I Learned to Stop Worrying and Love the ChatGPT

Replication package for MSR'24 Mining Challenge

https://2024.msrconf.org/track/msr-2024-mining-challenge

First time setup

You can set up the environment for using this project, following the recommended practices (described in later part of this document), by running the init.bash Bash script, and following its instructions.

Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, you are probably better following the steps described in this document manually.

Virtual environment

To avoid dependency conflicts, it is strongly recommended to create a virtual environment, for example with:

python3 -m venv venv

This needs to be done only once, from top directory of the project.For each session, you should activate the environment:

source venv/bin/activate

This would make command line prompt include "(venv) " as prefix, thought it depends on the shell used.

Using virtual environment, either directly like shown above, or by using pipx, might be required if you cannot install system packages, but Python is configured in a very specific way:

error: externally-managed-environment

× This environment is externally managed

Installing dependencies

You can install dependencies defined in requirements.txt file with pip using the following command:

python -m pip install -r requirements.txt

Note: the above assumes that you have activated virtual environment (venv).

Running with DVC

You can re-run whole computation pipeline with dvc repro, or at least those parts that were made to use DVC (Data Version Control) tool.

You can also run experiments with dvc exp run.

Configuring local DVC cache (optional)

Because the initial external DevGPT dataset is quite large (it is 650 MB as *.zip file, and 3.9 GB uncompressed into directory), you might want to store DVC cache in some other place than your home repository.

You can do that with dvc cache dir command:

dvc cache dir --local /mnt/data/username/.dvc/cache

where you need to replace username with your login (on Linux you can find it with the help of whoami command).

Configuring local DVC storage

To avoid recomputing results, which takes time, you can configure local dvc remote storage, for example:

cat <<EOF >>.dvc/config.local
[core]
    remote = local
['remote "local"']
    url = /mnt/data/dvcstore
EOF

Then you would be able to download computed data with dvc pull, and upload your results for others in the team with dvc push. This assumes that you all have access to /mnt/data/dvcstore, either via doing the work on the same host (perhaps remotely), or it is network storage available for all people in the team.

Tip!

Press p or to see the previous file or, n or to see the next file

About

Code for MSR'24 Mining Challenge paper: https://2024.msrconf.org/track/msr-2024-mining-challenge "How I Learned to Stop Worrying and Love ChatGPT"

https://2024.msrconf.org/details/msr-2024-mining-challenge/6
Collaborators 2

Comments

Loading...