Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

dvc.yaml 3.8 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
  1. stages:
  2. make_dataset:
  3. desc: Download data from Kaggle, create data dictionary and summary dtable
  4. cmd: python3 src/data/make_dataset.py -c titanic -tr train.csv -te test.csv -o
  5. ./data/raw
  6. deps:
  7. - src/data/make_dataset.py
  8. params:
  9. - dtypes
  10. outs:
  11. - data/raw/test.csv
  12. - data/raw/train.csv
  13. - reports/figures/data_dictionary.tex
  14. - reports/figures/table_one.tex
  15. encode_labels:
  16. desc: Convert categorical labels to integer values and save mapping
  17. cmd: python3 src/data/encode_labels.py -tr data/raw/train.csv -te data/raw/test.csv
  18. -o data/interim
  19. deps:
  20. - data/raw/test.csv
  21. - data/raw/train.csv
  22. - src/data/encode_labels.py
  23. params:
  24. - dtypes
  25. outs:
  26. - data/interim/label_encoding.yaml
  27. - data/interim/test_categorized.csv
  28. - data/interim/train_categorized.csv
  29. impute_nan:
  30. desc: Replace missing values for age with mean values from training dataset.
  31. cmd: python3 src/data/replace_nan.py -tr data/interim/train_categorized.csv -te
  32. data/interim/test_categorized.csv -o data/interim
  33. deps:
  34. - data/interim/test_categorized.csv
  35. - data/interim/train_categorized.csv
  36. - src/data/replace_nan.py
  37. params:
  38. - imputation
  39. outs:
  40. - data/interim/test_nan_imputed.csv
  41. - data/interim/train_nan_imputed.csv
  42. build_features:
  43. desc: Optional feature engineering and dimensionality reduction
  44. cmd: python3 src/features/build_features.py -tr data/interim/train_nan_imputed.csv
  45. -te data/interim/test_nan_imputed.csv -o data/interim/
  46. deps:
  47. - data/interim/test_nan_imputed.csv
  48. - data/interim/train_nan_imputed.csv
  49. - src/features/build_features.py
  50. params:
  51. - feature_eng
  52. - random_seed
  53. outs:
  54. - data/interim/test_featurized.csv
  55. - data/interim/train_featurized.csv
  56. normalize_data:
  57. desc: Optionally normalize features by fitting transforms on the training dataset.
  58. cmd: python3 src/features/normalize.py -tr data/interim/train_featurized.csv -te
  59. data/interim/test_featurized.csv -o data/processed/
  60. deps:
  61. - data/interim/test_featurized.csv
  62. - data/interim/train_featurized.csv
  63. - src/features/normalize.py
  64. params:
  65. - normalize
  66. outs:
  67. - data/processed/test_processed.csv
  68. - data/processed/train_processed.csv
  69. split_train_dev:
  70. desc: Split training data into the train and dev sets using stratified K-fold
  71. cross validation.
  72. cmd: python3 src/data/split_train_dev.py -tr data/processed/train_processed.csv
  73. -o data/processed/
  74. deps:
  75. - data/processed/train_processed.csv
  76. - src/data/split_train_dev.py
  77. params:
  78. - random_seed
  79. - train_test_split
  80. outs:
  81. - data/processed/split_train_dev.csv
  82. train_model:
  83. desc: Train the specified classifier using the pre-allocated stratified K-fold
  84. cross validation splits and the current params.yaml settings. Track metrics
  85. with Git
  86. cmd: python3 src/models/train_model.py -tr data/processed/train_processed.csv
  87. -cv data/processed/split_train_dev.csv
  88. deps:
  89. - data/processed/split_train_dev.csv
  90. - data/processed/train_processed.csv
  91. - src/models/train_model.py
  92. params:
  93. - classifier
  94. - model_params
  95. - random_seed
  96. - train_test_split.target_class
  97. outs:
  98. - models/estimator.pkl
  99. metrics:
  100. - results/metrics.json:
  101. cache: false
  102. predict_output:
  103. desc: Predict output on held-out test set for submission to Kaggle.
  104. cmd: python3 src/models/predict.py -te data/processed/test_processed.csv -rd results/
  105. -md models/
  106. deps:
  107. - data/processed/test_processed.csv
  108. - models/estimator.pkl
  109. - src/models/metrics.py
  110. - src/models/predict.py
  111. params:
  112. - predict
  113. - train_test_split.target_class
  114. outs:
  115. - results/test_predict_binary.csv
  116. - results/test_predict_proba.csv
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...