Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

skope_rules.py 20 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
  1. '''
  2. Skope-rules aims at learning logical, interpretable rules for "scoping" a target
  3. class, i.e. detecting with high precision instances of this class.
  4. Skope-rules is a trade off between the interpretability of a Decision Tree
  5. and the modelization power of a Random Forest. Code adapted with only minor changes
  6. from [here](https://github.com/scikit-learn-contrib/skope-rules). Full credit to
  7. the authors. You can access the original project and docs `here <http://skope-rules.readthedocs.io/en/latest/>`_
  8. Example
  9. -------
  10. from sklearn.datasets import load_boston
  11. from sklearn.metrics import precision_recall_curve
  12. from matplotlib import pyplot as plt
  13. from skrules import SkopeRulesClassifier
  14. dataset = load_boston()
  15. clf = SkopeRulesClassifier(max_depth_duplication=None,
  16. n_estimators=30,
  17. precision_min=0.2,
  18. recall_min=0.01,
  19. feature_names=dataset.feature_names)
  20. X, y = dataset.data, dataset.target > 25
  21. X_train, y_train = X[:len(y)//2], y[:len(y)//2]
  22. X_test, y_test = X[len(y)//2:], y[len(y)//2:]
  23. clf.fit(X_train, y_train)
  24. y_score = clf._score_top_rules(X_test) # Get a risk score for each test example
  25. precision, recall, _ = precision_recall_curve(y_test, y_score)
  26. plt.plot(recall, precision)
  27. plt.xlabel('Recall')
  28. plt.ylabel('Precision')
  29. plt.title('Precision Recall curve')
  30. plt.show()
  31. Links with existing literature
  32. -------------------------------
  33. The main advantage of decision rules is that they are offering interpretable models. The problem of generating such
  34. rules has been widely considered in machine learning, see e.g. RuleFit [1], Slipper [2], LRI [3], MLRules[4].
  35. A decision rule is a logical expression of the form "IF conditions THEN response". In a binary classification setting,
  36. if an instance satisfies conditions of the rule, then it is assigned to one of the two classes. If this instance does
  37. not satisfy conditions, it remains unassigned.
  38. 1) In [2, 3, 4], rules induction is done by considering each single decision rule as a base classifier in an ensemble,
  39. which is built by greedily minimizing some loss function.
  40. 2) In [1], rules are extracted from an ensemble of trees; a weighted combination of these rules is then built by solving
  41. a L1-regularized optimization problem over the weights as described in [5].
  42. In this package, we use the second approach. Rules are extracted from tree ensemble, which allow us to take advantage of
  43. existing fast algorithms (such as bagged decision trees, or gradient boosting) to produce such tree ensemble. Too
  44. similar or duplicated rules are then removed, based on a similarity threshold of their supports..
  45. The main goal of this package is to provide rules verifying precision and recall conditions. It still implement a score
  46. (`decision_function`) method, but which does not solve the L1-regularized optimization problem as in [1]. Instead,
  47. weights are simply proportional to the OOB associated precision of the rule.
  48. This package also offers convenient methods to compute predictions with the k most precise rules (cf _score_top_rules()
  49. and _predict_top_rules() functions).
  50. [1] Friedman and Popescu, Predictive learning via rule ensembles,Technical Report, 2005.
  51. [2] Cohen and Singer, A simple, fast, and effective rule learner, National Conference on Artificial Intelligence, 1999.
  52. [3] Weiss and Indurkhya, Lightweight rule induction, ICML, 2000.
  53. [4] Dembczyński, Kotłowski and Słowiński, Maximum Likelihood Rule Ensembles, ICML, 2008.
  54. [5] Friedman and Popescu, Gradient directed regularization, Technical Report, 2004.
  55. '''
  56. import numbers
  57. from warnings import warn
  58. from typing import List, Tuple
  59. import numpy as np
  60. import pandas
  61. import six
  62. from sklearn.base import BaseEstimator, ClassifierMixin
  63. from sklearn.utils.multiclass import check_classification_targets, unique_labels
  64. from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
  65. from imodels.rule_set.rule_set import RuleSet
  66. from imodels.util.arguments import check_fit_arguments
  67. from imodels.util.rule import replace_feature_name, get_feature_dict, Rule
  68. from imodels.util.extract import extract_skope
  69. from imodels.util.score import score_precision_recall
  70. from imodels.util.prune import prune_mins, deduplicate
  71. INTEGER_TYPES = (numbers.Integral, np.integer)
  72. BASE_FEATURE_NAME = "feature_"
  73. class SkopeRulesClassifier(BaseEstimator, RuleSet, ClassifierMixin):
  74. """An easy-interpretable classifier optimizing simple logical rules.
  75. Parameters
  76. ----------
  77. feature_names : list of str, optional
  78. The names of each feature to be used for returning rules in string
  79. format.
  80. precision_min : float, optional (default=0.5)
  81. The minimal precision of a rule to be selected.
  82. recall_min : float, optional (default=0.01)
  83. The minimal recall of a rule to be selected.
  84. n_estimators : int, optional (default=10)
  85. The number of base estimators (rules) to use for prediction. More are
  86. built before selection. All are available in the estimators_ attribute.
  87. max_samples : int or float, optional (default=.8)
  88. The number of samples to draw from X to train each decision tree, from
  89. which rules are generated and selected.
  90. - If int, then draw `max_samples` samples.
  91. - If float, then draw `max_samples * X.shape[0]` samples.
  92. If max_samples is larger than the number of samples provided,
  93. all samples will be used for all trees (no sampling).
  94. max_samples_features : int or float, optional (default=1.0)
  95. The number of features to draw from X to train each decision tree, from
  96. which rules are generated and selected.
  97. - If int, then draw `max_features` features.
  98. - If float, then draw `max_features * X.shape[1]` features.
  99. bootstrap : boolean, optional (default=False)
  100. Whether samples are drawn with replacement.
  101. bootstrap_features : boolean, optional (default=False)
  102. Whether features are drawn with replacement.
  103. max_depth : integer or List or None, optional (default=3)
  104. The maximum depth of the decision trees. If None, then nodes are
  105. expanded until all leaves are pure or until all leaves contain less
  106. than min_samples_split samples.
  107. If an iterable is passed, you will train n_estimators
  108. for each tree depth. It allows you to create and compare
  109. rules of different length.
  110. max_depth_duplication : integer, optional (default=None)
  111. The maximum depth of the decision tree for rule deduplication,
  112. if None then no deduplication occurs.
  113. max_features : int, float, string or None, optional (default="auto")
  114. The number of features considered (by each decision tree) when looking
  115. for the best split:
  116. - If int, then consider `max_features` features at each split.
  117. - If float, then `max_features` is a percentage and
  118. `int(max_features * n_features)` features are considered at each
  119. split.
  120. - If "auto", then `max_features=sqrt(n_features)`.
  121. - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
  122. - If "log2", then `max_features=log2(n_features)`.
  123. - If None, then `max_features=n_features`.
  124. Note: the search for a split does not stop until at least one
  125. valid partition of the node samples is found, even if it requires to
  126. effectively inspect more than ``max_features`` features.
  127. min_samples_split : int, float, optional (default=2)
  128. The minimum number of samples required to split an internal node for
  129. each decision tree.
  130. - If int, then consider `min_samples_split` as the minimum number.
  131. - If float, then `min_samples_split` is a percentage and
  132. `ceil(min_samples_split * n_samples)` are the minimum
  133. number of samples for each split.
  134. n_jobs : integer, optional (default=1)
  135. The number of jobs to run in parallel for both `fit` and `predict`.
  136. If -1, then the number of jobs is set to the number of cores.
  137. random_state : int, RandomState instance or None, optional
  138. - If int, random_state is the seed used by the random number generator.
  139. - If RandomState instance, random_state is the random number generator.
  140. - If None, the random number generator is the RandomState instance used
  141. by `np.random`.
  142. verbose : int, optional (default=0)
  143. Controls the verbosity of the tree building process.
  144. Attributes
  145. ----------
  146. rules_ : dict of tuples (rule, precision, recall, nb).
  147. The collection of `n_estimators` rules used in the ``predict`` method.
  148. The rules are generated by fitted sub-estimators (decision trees). Each
  149. rule satisfies recall_min and precision_min conditions. The selection
  150. is done according to OOB precisions.
  151. estimators_ : list of DecisionTreeClassifier
  152. The collection of fitted sub-estimators used to generate candidate
  153. rules.
  154. estimators_samples_ : list of arrays
  155. The subset of drawn samples (i.e., the in-bag samples) for each base
  156. estimator.
  157. estimators_features_ : list of arrays
  158. The subset of drawn features for each base estimator.
  159. max_samples_ : integer
  160. The actual number of samples
  161. n_features_ : integer
  162. The number of features when ``fit`` is performed.
  163. classes_ : array, shape (n_classes,)
  164. The classes labels.
  165. """
  166. def __init__(self,
  167. precision_min=0.5,
  168. recall_min=0.01,
  169. n_estimators=10,
  170. max_samples=.8,
  171. max_samples_features=.8,
  172. bootstrap=False,
  173. bootstrap_features=False,
  174. max_depth=3,
  175. max_depth_duplication=None,
  176. max_features=1.,
  177. min_samples_split=2,
  178. n_jobs=1,
  179. random_state=None,
  180. verbose=0):
  181. self.precision_min = precision_min
  182. self.recall_min = recall_min
  183. self.n_estimators = n_estimators
  184. self.max_samples = max_samples
  185. self.max_samples_features = max_samples_features
  186. self.bootstrap = bootstrap
  187. self.bootstrap_features = bootstrap_features
  188. self.max_depth = max_depth
  189. self.max_depth_duplication = max_depth_duplication
  190. self.max_features = max_features
  191. self.min_samples_split = min_samples_split
  192. self.n_jobs = n_jobs
  193. self.random_state = random_state
  194. self.verbose = verbose
  195. def fit(self, X, y, feature_names=None, sample_weight=None):
  196. """Fit the model according to the given training data.
  197. Parameters
  198. ----------
  199. X : array-like, shape (n_samples, n_features)
  200. Training vector, where n_samples is the number of samples and
  201. n_features is the number of features.
  202. y : array-like, shape (n_samples,)
  203. Target vector relative to X. Has to follow the convention 0 for
  204. normal data, 1 for anomalies.
  205. sample_weight : array-like, shape (n_samples,) optional
  206. Array of weights that are assigned to individual samples, typically
  207. the amount in case of transactions data. Used to grow regression
  208. trees producing further rules to be tested.
  209. If not provided, then each sample is given unit weight.
  210. Returns
  211. -------
  212. self : object
  213. Returns self.
  214. """
  215. X, y, feature_names = check_fit_arguments(self, X, y, feature_names)
  216. check_classification_targets(y)
  217. self.n_features_ = X.shape[1]
  218. self.sample_weight = sample_weight
  219. self.classes_ = unique_labels(y)
  220. n_classes = len(self.classes_)
  221. if n_classes < 2:
  222. raise ValueError(
  223. "This method needs samples of at least 2 classes in the data, but the data contains only one class: %r"
  224. % self.classes_[0]
  225. )
  226. if not isinstance(self.max_depth_duplication, int) and self.max_depth_duplication is not None:
  227. raise ValueError("max_depth_duplication should be an integer")
  228. if not set(self.classes_) == {0, 1}:
  229. warn(
  230. "Found labels %s. This method assumes target class to be labeled as 1 and normal data to be labeled as "
  231. "0. Any label different from 0 will be considered as being from the target class."
  232. % set(self.classes_)
  233. )
  234. y = (y > 0)
  235. # ensure that max_samples is in [1, n_samples]:
  236. n_samples = X.shape[0]
  237. if isinstance(self.max_samples, six.string_types):
  238. raise ValueError(
  239. 'max_samples (%s) is not supported. Valid choices are: "auto", int or float'
  240. % self.max_samples
  241. )
  242. elif isinstance(self.max_samples, INTEGER_TYPES):
  243. if self.max_samples > n_samples:
  244. warn(
  245. "max_samples (%s) is greater than the total number of samples (%s). max_samples will be set "
  246. "to n_samples for estimation."
  247. % (self.max_samples, n_samples)
  248. )
  249. max_samples = n_samples
  250. else:
  251. max_samples = self.max_samples
  252. else: # float
  253. if not (0. < self.max_samples <= 1.):
  254. raise ValueError("max_samples must be in (0, 1], got %r" % self.max_samples)
  255. max_samples = int(self.max_samples * X.shape[0])
  256. self.max_samples_ = max_samples
  257. self.feature_dict_ = get_feature_dict(X.shape[1], feature_names)
  258. self.feature_placeholders = np.array(list(self.feature_dict_.keys()))
  259. self.feature_names = np.array(list(self.feature_dict_.values()))
  260. extracted_rules, self.estimators_samples_, self.estimators_features_ = self._extract_rules(X, y)
  261. scored_rules = self._score_rules(X, y, extracted_rules)
  262. self.rules_ = self._prune_rules(scored_rules)
  263. self.rules_without_feature_names_ = self.rules_
  264. self.rules_ = [
  265. replace_feature_name(rule, self.feature_dict_) for rule in self.rules_
  266. ]
  267. self.complexity_ = self._get_complexity()
  268. return self
  269. def predict(self, X) -> np.ndarray:
  270. """Predict if a particular sample is an outlier or not.
  271. Parameters
  272. ----------
  273. X : array-like, shape (n_samples, n_features)
  274. The input samples. Internally, it will be converted to
  275. ``dtype=np.float32``
  276. Returns
  277. -------
  278. is_outlier : array, shape (n_samples,)
  279. For each observations, tells whether or not (1 or 0) it should
  280. be considered as an outlier according to the selected rules.
  281. """
  282. X = check_array(X)
  283. return np.argmax(self.predict_proba(X), axis=1)
  284. def predict_proba(self, X) -> np.ndarray:
  285. '''Predict probability of a particular sample being an outlier or not
  286. '''
  287. X = check_array(X)
  288. weight_sum = np.sum([w[0] for (r, w) in self.rules_without_feature_names_])
  289. if weight_sum == 0:
  290. return np.vstack((np.ones(X.shape[0]), np.zeros(X.shape[0]))).transpose()
  291. y = self._eval_weighted_rule_sum(X) / weight_sum
  292. return np.vstack((1 - y, y)).transpose()
  293. def _rules_vote(self, X) -> np.ndarray:
  294. """Score representing a vote of the base classifiers (rules).
  295. The score of an input sample is computed as the sum of the binary
  296. rules outputs: a score of k means than k rules have voted positively.
  297. Parameters
  298. ----------
  299. X : array-like, shape (n_samples, n_features)
  300. The training input samples.
  301. Returns
  302. -------
  303. scores : array, shape (n_samples,)
  304. The score of the input samples.
  305. The higher, the more abnormal. Positive scores represent outliers,
  306. null scores represent inliers.
  307. """
  308. # Check if fit had been called
  309. check_is_fitted(self, ['rules_', 'estimators_samples_', 'max_samples_'])
  310. # Input validation
  311. X = check_array(X)
  312. if X.shape[1] != self.n_features_:
  313. raise ValueError("X.shape[1] = %d should be equal to %d, "
  314. "the number of features at training time."
  315. " Please reshape your data."
  316. % (X.shape[1], self.n_features_))
  317. df = pandas.DataFrame(X, columns=self.feature_placeholders)
  318. selected_rules = self.rules_without_feature_names_
  319. scores = np.zeros(X.shape[0])
  320. for (r, _) in selected_rules:
  321. scores[list(df.query(r).index)] += 1
  322. return scores
  323. def _score_top_rules(self, X) -> np.ndarray:
  324. """Score representing an ordering between the base classifiers (rules).
  325. The score is high when the instance is detected by a performing rule.
  326. If there are n rules, ordered by increasing OOB precision, a score of k
  327. means than the kth rule has voted positively, but not the (k-1) first
  328. rules.
  329. Parameters
  330. ----------
  331. X : array-like, shape (n_samples, n_features)
  332. The training input samples.
  333. Returns
  334. -------
  335. scores : array, shape (n_samples,)
  336. The score of the input samples.
  337. Positive scores represent outliers, null scores represent inliers.
  338. """
  339. # Check if fit had been called
  340. check_is_fitted(self, ['rules_', 'estimators_samples_', 'max_samples_'])
  341. # Input validation
  342. X = check_array(X)
  343. if X.shape[1] != self.n_features_:
  344. raise ValueError("X.shape[1] = %d should be equal to %d, "
  345. "the number of features at training time."
  346. " Please reshape your data."
  347. % (X.shape[1], self.n_features_))
  348. df = pandas.DataFrame(X, columns=self.feature_placeholders)
  349. selected_rules = self.rules_without_feature_names_
  350. scores = np.zeros(X.shape[0])
  351. for (k, r) in enumerate(list((selected_rules))):
  352. scores[list(df.query(r.rule).index)] = np.maximum(
  353. len(selected_rules) - k,
  354. scores[list(df.query(r.rule).index)])
  355. return scores
  356. def _predict_top_rules(self, X, n_rules) -> np.ndarray:
  357. """Predict if a particular sample is an outlier or not,
  358. using the n_rules most performing rules.
  359. Parameters
  360. ----------
  361. X : array-like, shape (n_samples, n_features)
  362. The input samples. Internally, it will be converted to
  363. ``dtype=np.float32``
  364. n_rules : int
  365. The number of rules used for the prediction. If one of the
  366. n_rules most performing rules is activated, the prediction
  367. is equal to 1.
  368. Returns
  369. -------
  370. is_outlier : array, shape (n_samples,)
  371. For each observations, tells whether or not (1 or 0) it should
  372. be considered as an outlier according to the selected rules.
  373. """
  374. return np.array((self._score_top_rules(X) > len(self.rules_) - n_rules),
  375. dtype=int)
  376. def _extract_rules(self, X, y) -> Tuple[List[str], List[np.array], List[np.array]]:
  377. return extract_skope(X, y,
  378. feature_names=self.feature_placeholders,
  379. sample_weight=self.sample_weight,
  380. n_estimators=self.n_estimators,
  381. max_samples=self.max_samples_,
  382. max_samples_features=self.max_samples_features,
  383. bootstrap=self.bootstrap,
  384. bootstrap_features=self.bootstrap_features,
  385. max_depths=self.max_depth,
  386. max_features=self.max_features,
  387. min_samples_split=self.min_samples_split,
  388. n_jobs=self.n_jobs,
  389. random_state=self.random_state,
  390. verbose=self.verbose)
  391. def _score_rules(self, X, y, rules) -> List[Rule]:
  392. return score_precision_recall(X, y, rules, self.estimators_samples_, self.estimators_features_, self.feature_placeholders)
  393. def _prune_rules(self, rules) -> List[Rule]:
  394. return deduplicate(
  395. prune_mins(rules, self.precision_min, self.recall_min),
  396. self.max_depth_duplication
  397. )
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...