Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git github
e486cb52f5
.bohr/.gitignore
3 years ago
809cf842c1
restore remove pre-commit-config
3 years ago
08d4ed4cc5
ci: use cml action instead of setting it up manually
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
bin
a7820127b8
insatall and configutr dvc in bohr setup script
3 years ago
6b344e0234
fix in transformer file
3 years ago
0c46e48310
add stages for idans-dataset2
3 years ago
cbc967dcca
add label model outputs
3 years ago
d9c916f9f3
add Idan's dataset to the pipeline
3 years ago
0c46e48310
add stages for idans-dataset2
3 years ago
3 years ago
0c46e48310
add stages for idans-dataset2
3 years ago
cbc967dcca
add label model outputs
3 years ago
a49e3bb7e4
upgrade to bohr-framework 0.3.0 (#154)
3 years ago
7604e65c5f
add zip for idans-dataset2
3 years ago
fabe65b8c3
move bohr-framework to a separate repo (#131)
3 years ago
7604e65c5f
add zip for idans-dataset2
3 years ago
a5279a3c3f
add stages for running the transforemer on idans-dataset2 and creatign combined zip
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
cd001e03fe
add setup-bohr script
3 years ago
f2b487f68f
WIP #76: restructure the repo into framework and other (#90)
3 years ago
0c46e48310
add stages for idans-dataset2
3 years ago
809cf842c1
restore remove pre-commit-config
3 years ago
8dd7c80d2e
Pylint - black compatibility (#80)
3 years ago
5cbb8ede96
Fix reproduce action (#113)
3 years ago
7f11f72192
use setup-bohr script for travis build
3 years ago
b28a1b48f2
add license (#118)
3 years ago
8acdd82630
Update README.rst
3 years ago
0c46e48310
add stages for idans-dataset2
3 years ago
7604e65c5f
add zip for idans-dataset2
3 years ago
7604e65c5f
add zip for idans-dataset2
3 years ago
fabe65b8c3
move bohr-framework to a separate repo (#131)
3 years ago
f0a84ba0d2
Add pylint and fix some warnings (#78)
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.rst

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
  1. BOHR
  2. ----------------------------------
  3. Big Old Heuristic Repository
  4. .. contents:: **Contents**
  5. :backlinks: none
  6. Getting started with BOHR
  7. ===========================================
  8. Python >= 3.8 is required, preferably use virtual environment.
  9. #. Run ``git clone https://github.com/giganticode/bohr && cd bohr``
  10. #. Install BOHR framework library: ``chmod +x bin/setup-bohr.sh && bin/setup-bohr.sh``. This will install `bohr-framework <https://github.com/giganticode/bohr-framework>`, dependencies and tools to run heursistics.
  11. Downloading datasets and models
  12. ===============================
  13. #. Run ``bohr repro``
  14. Bohr extensively uses `DVC (Data Version Control) <https://dvc.org/>`_ to ensure of the datasets and models.
  15. Contributing to BOHR:
  16. =====================
  17. 1. Heuristics:
  18. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  19. Heuristics can be found in ``.py`` files in the ``bohr/heuristics`` directory, and are marked with @Heuristic decorator. Example:
  20. .. code-block:: python
  21. @Heuristic(Commit)
  22. def bugless_if_many_files_changes(commit: Commit) -> Optional[Labels]:
  23. if len(commit.files) > 6:
  24. return CommitLabel.NonBugFix
  25. else:
  26. return None
  27. Important things to note:
  28. #. Any function becomes a heuristic once it is marked with ``@Heuristic`` decorator
  29. #. Artifact type is passed to heuristic decorator as a parameter; method accepts an object of artifact type
  30. #. Method name can be arbitrary as long it is unique and descriptive
  31. #. Method should return ``label`` if a datapoint should be labeled with ``label``, ``None`` if the labeling function should abstain on the datapoint
  32. Please refer to the `documentation <https://giganticode.github.io/bohr/Heuristics.html>`_ for more information on heuristics and special heuristic types.
  33. 2. New tasks:
  34. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  35. Tasks are defined in the `bohr.json` file. Below you can see an example of "bugginess" task.
  36. .. code-block:: json
  37. "bugginess": {
  38. "top_artifact": "bohr.artifacts.commit.Commit",
  39. "label_categories": [
  40. "CommitLabel.NonBugFix",
  41. "CommitLabel.BugFix"
  42. ],
  43. "test_datasets": [
  44. "datasets.1151-commits",
  45. "datasets.berger",
  46. "datasets.herzig"
  47. ],
  48. "train_datasets": [
  49. "datasets.bugginess-train"
  50. ],
  51. "label_column_name": "bug"
  52. }
  53. The name of the task is the key in the dictionary. The value is an object with the following fields:
  54. #. **Top artifact** - the artifact to be catigorized. In the case of "bugginess" task, commits are classified, therefore the top artifact is ``bohr.artifacts.commit.Commit``;
  55. #. **Label categories** - categories artifact to be classified as, for "bugginess" taks these are *CommitLabel.BugFix* and *CommitLabel.NonBugFix*. Values has to be taken from the ``labels.py`` file. See section `3. Labels:`_ on more information about labels in bohr and how to extend the label hierarchy.
  56. #. **Training sets** - datasets used to train a label model;
  57. #. **Test sets** - datasets to calculate metrics on.
  58. 3. Labels:
  59. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  60. Labels that are used to label artifacts in BOHR are pre-defined and can be reused across multiple tasks. E.g., ``Commit.Refactoring`` label can be used in heuristics for the tasks of detecting refactoring, but also in the task of detecting bug-fixing commits. Moreover, labels are organized in a hierarchy, e.g. ``Commit.FileRenaming`` can be a child of ``Commit.Refactoring``. Formally speaking, there is a binary relation IS-A defined on the set of labels, which defines their partial order, e.g. ``IS-A(Commit.FileRenaming, Commit.Refactoring)``
  61. Labels are defined in text files in the ``bohr/labels`` dir. Each row has a format: <parent>: <list of children>. Running ``bohr parse-labels`` will generate `labels.py` file in the root of the repository. Thus to extend the hierarchy of labels it's sufficient to make a change to a text file. The `label.py` will be regenerated, once the PR is received.
  62. 4. Datasets
  63. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  64. A datasets are added by creating a dataset file in ``datasets`` folder. The name of the file will correspond to the name of the dataset. e.g.:
  65. *datasets/1151-commits.py*:
  66. .. code-block:: python
  67. from pathlib import Path
  68. from bohr.templates.dataloaders.from_csv import CsvDatasetLoader
  69. from bohr.templates.datamappers.commit import CommitMapper
  70. dataset_loader = CsvDatasetLoader(
  71. path_to_file="data/bugginess/test/1151-commits.csv",
  72. mapper=CommitMapper(Path(__file__).parent.parent),
  73. test_set=True,
  74. )
  75. __all__ = [dataset_loader]
  76. In this file, an instance of ``CsvDatasetLoader`` object is created, which is added to the __all__ list (important!)
  77. Dataloader can be an instance of custom ``DatasetLoader`` implementing the following interface:
  78. .. code-block:: python
  79. @dataclass
  80. class DatasetLoader(ABC):
  81. test_set: bool
  82. mapper: ArtifactMapper
  83. @abstractmethod
  84. def load(self, project_root: Path) -> DataFrame:
  85. pass
  86. @abstractmethod
  87. def get_paths(self, project_root: Path) -> List[Path]:
  88. pass
  89. *ArtifactMapper* object that has to be passed to the ``DatasetLoader`` defines how each datapoint is mapped to an artifact object and has to implement the following interface:
  90. .. code-block:: python
  91. class ArtifactMapper(BaseMapper, ABC):
  92. @abstractmethod
  93. def __call__(self, x: DataPoint) -> Artifact:
  94. pass
  95. @abstractmethod
  96. def get_artifact(self) -> Type[Artifact]:
  97. pass
  98. ``bohr.templates.datamappers`` in the bohr-framework lib provide some predefined mappers.
  99. 5 Artifact definitions
  100. ~~~~~~~~~~~~~~~~~~~~~~~~
  101. ``bohr.templates.artifacts`` also defines some pre-defined artifacts
  102. Contribute to the framework:
  103. =============================
  104. To contribute to the framework, please refer to the documentation in the the `bohr-framework <https://github.com/giganticode/bohr-framework>`_ repo.
  105. Pre-prints and publications
  106. ===========================================
  107. .. code-block::
  108. @misc{babii2021mining,
  109. title={Mining Software Repositories with a Collaborative Heuristic Repository},
  110. author={Hlib Babii and Julian Aron Prenner and Laurin Stricker and Anjan Karmakar and Andrea Janes and Romain Robbes},
  111. year={2021},
  112. eprint={2103.01722},
  113. archivePrefix={arXiv},
  114. primaryClass={cs.SE}
  115. }
Tip!

Press p or to see the previous file or, n or to see the next file

About

Big Old Heuristic Repository

Collaborators 1

Comments

Loading...