1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
|
- '''
- Skope-rules aims at learning logical, interpretable rules for "scoping" a target
- class, i.e. detecting with high precision instances of this class.
- Skope-rules is a trade off between the interpretability of a Decision Tree
- and the modelization power of a Random Forest. Code adapted with only minor changes
- from [here](https://github.com/scikit-learn-contrib/skope-rules). Full credit to
- the authors. You can access the original project and docs `here <http://skope-rules.readthedocs.io/en/latest/>`_
- Example
- -------
- from sklearn.datasets import load_boston
- from sklearn.metrics import precision_recall_curve
- from matplotlib import pyplot as plt
- from skrules import SkopeRulesClassifier
-
- dataset = load_boston()
- clf = SkopeRulesClassifier(max_depth_duplication=None,
- n_estimators=30,
- precision_min=0.2,
- recall_min=0.01,
- feature_names=dataset.feature_names)
-
- X, y = dataset.data, dataset.target > 25
- X_train, y_train = X[:len(y)//2], y[:len(y)//2]
- X_test, y_test = X[len(y)//2:], y[len(y)//2:]
- clf.fit(X_train, y_train)
- y_score = clf._score_top_rules(X_test) # Get a risk score for each test example
- precision, recall, _ = precision_recall_curve(y_test, y_score)
- plt.plot(recall, precision)
- plt.xlabel('Recall')
- plt.ylabel('Precision')
- plt.title('Precision Recall curve')
- plt.show()
- Links with existing literature
- -------------------------------
- The main advantage of decision rules is that they are offering interpretable models. The problem of generating such
- rules has been widely considered in machine learning, see e.g. RuleFit [1], Slipper [2], LRI [3], MLRules[4].
- A decision rule is a logical expression of the form "IF conditions THEN response". In a binary classification setting,
- if an instance satisfies conditions of the rule, then it is assigned to one of the two classes. If this instance does
- not satisfy conditions, it remains unassigned.
- 1) In [2, 3, 4], rules induction is done by considering each single decision rule as a base classifier in an ensemble,
- which is built by greedily minimizing some loss function.
- 2) In [1], rules are extracted from an ensemble of trees; a weighted combination of these rules is then built by solving
- a L1-regularized optimization problem over the weights as described in [5].
- In this package, we use the second approach. Rules are extracted from tree ensemble, which allow us to take advantage of
- existing fast algorithms (such as bagged decision trees, or gradient boosting) to produce such tree ensemble. Too
- similar or duplicated rules are then removed, based on a similarity threshold of their supports..
- The main goal of this package is to provide rules verifying precision and recall conditions. It still implement a score
- (`decision_function`) method, but which does not solve the L1-regularized optimization problem as in [1]. Instead,
- weights are simply proportional to the OOB associated precision of the rule.
- This package also offers convenient methods to compute predictions with the k most precise rules (cf _score_top_rules()
- and _predict_top_rules() functions).
- [1] Friedman and Popescu, Predictive learning via rule ensembles,Technical Report, 2005.
- [2] Cohen and Singer, A simple, fast, and effective rule learner, National Conference on Artificial Intelligence, 1999.
- [3] Weiss and Indurkhya, Lightweight rule induction, ICML, 2000.
- [4] Dembczyński, Kotłowski and Słowiński, Maximum Likelihood Rule Ensembles, ICML, 2008.
- [5] Friedman and Popescu, Gradient directed regularization, Technical Report, 2004.
- '''
- import numbers
- from warnings import warn
- from typing import List, Tuple
- import numpy as np
- import pandas
- import six
- from sklearn.base import BaseEstimator, ClassifierMixin
- from sklearn.utils.multiclass import check_classification_targets, unique_labels
- from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
- from imodels.rule_set.rule_set import RuleSet
- from imodels.util.arguments import check_fit_arguments
- from imodels.util.rule import replace_feature_name, get_feature_dict, Rule
- from imodels.util.extract import extract_skope
- from imodels.util.score import score_precision_recall
- from imodels.util.prune import prune_mins, deduplicate
- INTEGER_TYPES = (numbers.Integral, np.integer)
- BASE_FEATURE_NAME = "feature_"
- class SkopeRulesClassifier(BaseEstimator, RuleSet, ClassifierMixin):
- """An easy-interpretable classifier optimizing simple logical rules.
- Parameters
- ----------
- feature_names : list of str, optional
- The names of each feature to be used for returning rules in string
- format.
- precision_min : float, optional (default=0.5)
- The minimal precision of a rule to be selected.
- recall_min : float, optional (default=0.01)
- The minimal recall of a rule to be selected.
- n_estimators : int, optional (default=10)
- The number of base estimators (rules) to use for prediction. More are
- built before selection. All are available in the estimators_ attribute.
- max_samples : int or float, optional (default=.8)
- The number of samples to draw from X to train each decision tree, from
- which rules are generated and selected.
- - If int, then draw `max_samples` samples.
- - If float, then draw `max_samples * X.shape[0]` samples.
- If max_samples is larger than the number of samples provided,
- all samples will be used for all trees (no sampling).
- max_samples_features : int or float, optional (default=1.0)
- The number of features to draw from X to train each decision tree, from
- which rules are generated and selected.
- - If int, then draw `max_features` features.
- - If float, then draw `max_features * X.shape[1]` features.
- bootstrap : boolean, optional (default=False)
- Whether samples are drawn with replacement.
- bootstrap_features : boolean, optional (default=False)
- Whether features are drawn with replacement.
- max_depth : integer or List or None, optional (default=3)
- The maximum depth of the decision trees. If None, then nodes are
- expanded until all leaves are pure or until all leaves contain less
- than min_samples_split samples.
- If an iterable is passed, you will train n_estimators
- for each tree depth. It allows you to create and compare
- rules of different length.
- max_depth_duplication : integer, optional (default=None)
- The maximum depth of the decision tree for rule deduplication,
- if None then no deduplication occurs.
- max_features : int, float, string or None, optional (default="auto")
- The number of features considered (by each decision tree) when looking
- for the best split:
- - If int, then consider `max_features` features at each split.
- - If float, then `max_features` is a percentage and
- `int(max_features * n_features)` features are considered at each
- split.
- - If "auto", then `max_features=sqrt(n_features)`.
- - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
- - If "log2", then `max_features=log2(n_features)`.
- - If None, then `max_features=n_features`.
- Note: the search for a split does not stop until at least one
- valid partition of the node samples is found, even if it requires to
- effectively inspect more than ``max_features`` features.
- min_samples_split : int, float, optional (default=2)
- The minimum number of samples required to split an internal node for
- each decision tree.
- - If int, then consider `min_samples_split` as the minimum number.
- - If float, then `min_samples_split` is a percentage and
- `ceil(min_samples_split * n_samples)` are the minimum
- number of samples for each split.
- n_jobs : integer, optional (default=1)
- The number of jobs to run in parallel for both `fit` and `predict`.
- If -1, then the number of jobs is set to the number of cores.
- random_state : int, RandomState instance or None, optional
- - If int, random_state is the seed used by the random number generator.
- - If RandomState instance, random_state is the random number generator.
- - If None, the random number generator is the RandomState instance used
- by `np.random`.
- verbose : int, optional (default=0)
- Controls the verbosity of the tree building process.
- Attributes
- ----------
- rules_ : dict of tuples (rule, precision, recall, nb).
- The collection of `n_estimators` rules used in the ``predict`` method.
- The rules are generated by fitted sub-estimators (decision trees). Each
- rule satisfies recall_min and precision_min conditions. The selection
- is done according to OOB precisions.
- estimators_ : list of DecisionTreeClassifier
- The collection of fitted sub-estimators used to generate candidate
- rules.
- estimators_samples_ : list of arrays
- The subset of drawn samples (i.e., the in-bag samples) for each base
- estimator.
- estimators_features_ : list of arrays
- The subset of drawn features for each base estimator.
- max_samples_ : integer
- The actual number of samples
- n_features_ : integer
- The number of features when ``fit`` is performed.
- classes_ : array, shape (n_classes,)
- The classes labels.
- """
- def __init__(self,
- precision_min=0.5,
- recall_min=0.01,
- n_estimators=10,
- max_samples=.8,
- max_samples_features=.8,
- bootstrap=False,
- bootstrap_features=False,
- max_depth=3,
- max_depth_duplication=None,
- max_features=1.,
- min_samples_split=2,
- n_jobs=1,
- random_state=None,
- verbose=0):
- self.precision_min = precision_min
- self.recall_min = recall_min
- self.n_estimators = n_estimators
- self.max_samples = max_samples
- self.max_samples_features = max_samples_features
- self.bootstrap = bootstrap
- self.bootstrap_features = bootstrap_features
- self.max_depth = max_depth
- self.max_depth_duplication = max_depth_duplication
- self.max_features = max_features
- self.min_samples_split = min_samples_split
- self.n_jobs = n_jobs
- self.random_state = random_state
- self.verbose = verbose
- def fit(self, X, y, feature_names=None, sample_weight=None):
- """Fit the model according to the given training data.
- Parameters
- ----------
- X : array-like, shape (n_samples, n_features)
- Training vector, where n_samples is the number of samples and
- n_features is the number of features.
- y : array-like, shape (n_samples,)
- Target vector relative to X. Has to follow the convention 0 for
- normal data, 1 for anomalies.
- sample_weight : array-like, shape (n_samples,) optional
- Array of weights that are assigned to individual samples, typically
- the amount in case of transactions data. Used to grow regression
- trees producing further rules to be tested.
- If not provided, then each sample is given unit weight.
- Returns
- -------
- self : object
- Returns self.
- """
- X, y, feature_names = check_fit_arguments(self, X, y, feature_names)
- check_classification_targets(y)
- self.n_features_ = X.shape[1]
- self.sample_weight = sample_weight
- self.classes_ = unique_labels(y)
- n_classes = len(self.classes_)
- if n_classes < 2:
- raise ValueError(
- "This method needs samples of at least 2 classes in the data, but the data contains only one class: %r"
- % self.classes_[0]
- )
- if not isinstance(self.max_depth_duplication, int) and self.max_depth_duplication is not None:
- raise ValueError("max_depth_duplication should be an integer")
- if not set(self.classes_) == {0, 1}:
- warn(
- "Found labels %s. This method assumes target class to be labeled as 1 and normal data to be labeled as "
- "0. Any label different from 0 will be considered as being from the target class."
- % set(self.classes_)
- )
- y = (y > 0)
- # ensure that max_samples is in [1, n_samples]:
- n_samples = X.shape[0]
- if isinstance(self.max_samples, six.string_types):
- raise ValueError(
- 'max_samples (%s) is not supported. Valid choices are: "auto", int or float'
- % self.max_samples
- )
- elif isinstance(self.max_samples, INTEGER_TYPES):
- if self.max_samples > n_samples:
- warn(
- "max_samples (%s) is greater than the total number of samples (%s). max_samples will be set "
- "to n_samples for estimation."
- % (self.max_samples, n_samples)
- )
- max_samples = n_samples
- else:
- max_samples = self.max_samples
- else: # float
- if not (0. < self.max_samples <= 1.):
- raise ValueError("max_samples must be in (0, 1], got %r" % self.max_samples)
- max_samples = int(self.max_samples * X.shape[0])
- self.max_samples_ = max_samples
- self.feature_dict_ = get_feature_dict(X.shape[1], feature_names)
- self.feature_placeholders = np.array(list(self.feature_dict_.keys()))
- self.feature_names = np.array(list(self.feature_dict_.values()))
- extracted_rules, self.estimators_samples_, self.estimators_features_ = self._extract_rules(X, y)
- scored_rules = self._score_rules(X, y, extracted_rules)
- self.rules_ = self._prune_rules(scored_rules)
- self.rules_without_feature_names_ = self.rules_
- self.rules_ = [
- replace_feature_name(rule, self.feature_dict_) for rule in self.rules_
- ]
- self.complexity_ = self._get_complexity()
- return self
- def predict(self, X) -> np.ndarray:
- """Predict if a particular sample is an outlier or not.
- Parameters
- ----------
- X : array-like, shape (n_samples, n_features)
- The input samples. Internally, it will be converted to
- ``dtype=np.float32``
- Returns
- -------
- is_outlier : array, shape (n_samples,)
- For each observations, tells whether or not (1 or 0) it should
- be considered as an outlier according to the selected rules.
- """
- X = check_array(X)
- return np.argmax(self.predict_proba(X), axis=1)
- def predict_proba(self, X) -> np.ndarray:
- '''Predict probability of a particular sample being an outlier or not
- '''
- X = check_array(X)
- weight_sum = np.sum([w[0] for (r, w) in self.rules_without_feature_names_])
- if weight_sum == 0:
- return np.vstack((np.ones(X.shape[0]), np.zeros(X.shape[0]))).transpose()
- y = self._eval_weighted_rule_sum(X) / weight_sum
- return np.vstack((1 - y, y)).transpose()
- def _rules_vote(self, X) -> np.ndarray:
- """Score representing a vote of the base classifiers (rules).
- The score of an input sample is computed as the sum of the binary
- rules outputs: a score of k means than k rules have voted positively.
- Parameters
- ----------
- X : array-like, shape (n_samples, n_features)
- The training input samples.
- Returns
- -------
- scores : array, shape (n_samples,)
- The score of the input samples.
- The higher, the more abnormal. Positive scores represent outliers,
- null scores represent inliers.
- """
- # Check if fit had been called
- check_is_fitted(self, ['rules_', 'estimators_samples_', 'max_samples_'])
- # Input validation
- X = check_array(X)
- if X.shape[1] != self.n_features_:
- raise ValueError("X.shape[1] = %d should be equal to %d, "
- "the number of features at training time."
- " Please reshape your data."
- % (X.shape[1], self.n_features_))
- df = pandas.DataFrame(X, columns=self.feature_placeholders)
- selected_rules = self.rules_without_feature_names_
- scores = np.zeros(X.shape[0])
- for (r, _) in selected_rules:
- scores[list(df.query(r).index)] += 1
- return scores
- def _score_top_rules(self, X) -> np.ndarray:
- """Score representing an ordering between the base classifiers (rules).
- The score is high when the instance is detected by a performing rule.
- If there are n rules, ordered by increasing OOB precision, a score of k
- means than the kth rule has voted positively, but not the (k-1) first
- rules.
- Parameters
- ----------
- X : array-like, shape (n_samples, n_features)
- The training input samples.
- Returns
- -------
- scores : array, shape (n_samples,)
- The score of the input samples.
- Positive scores represent outliers, null scores represent inliers.
- """
- # Check if fit had been called
- check_is_fitted(self, ['rules_', 'estimators_samples_', 'max_samples_'])
- # Input validation
- X = check_array(X)
- if X.shape[1] != self.n_features_:
- raise ValueError("X.shape[1] = %d should be equal to %d, "
- "the number of features at training time."
- " Please reshape your data."
- % (X.shape[1], self.n_features_))
- df = pandas.DataFrame(X, columns=self.feature_placeholders)
- selected_rules = self.rules_without_feature_names_
- scores = np.zeros(X.shape[0])
- for (k, r) in enumerate(list((selected_rules))):
- scores[list(df.query(r.rule).index)] = np.maximum(
- len(selected_rules) - k,
- scores[list(df.query(r.rule).index)])
- return scores
- def _predict_top_rules(self, X, n_rules) -> np.ndarray:
- """Predict if a particular sample is an outlier or not,
- using the n_rules most performing rules.
- Parameters
- ----------
- X : array-like, shape (n_samples, n_features)
- The input samples. Internally, it will be converted to
- ``dtype=np.float32``
- n_rules : int
- The number of rules used for the prediction. If one of the
- n_rules most performing rules is activated, the prediction
- is equal to 1.
- Returns
- -------
- is_outlier : array, shape (n_samples,)
- For each observations, tells whether or not (1 or 0) it should
- be considered as an outlier according to the selected rules.
- """
- return np.array((self._score_top_rules(X) > len(self.rules_) - n_rules),
- dtype=int)
- def _extract_rules(self, X, y) -> Tuple[List[str], List[np.array], List[np.array]]:
- return extract_skope(X, y,
- feature_names=self.feature_placeholders,
- sample_weight=self.sample_weight,
- n_estimators=self.n_estimators,
- max_samples=self.max_samples_,
- max_samples_features=self.max_samples_features,
- bootstrap=self.bootstrap,
- bootstrap_features=self.bootstrap_features,
- max_depths=self.max_depth,
- max_features=self.max_features,
- min_samples_split=self.min_samples_split,
- n_jobs=self.n_jobs,
- random_state=self.random_state,
- verbose=self.verbose)
- def _score_rules(self, X, y, rules) -> List[Rule]:
- return score_precision_recall(X, y, rules, self.estimators_samples_, self.estimators_features_, self.feature_placeholders)
- def _prune_rules(self, rules) -> List[Rule]:
- return deduplicate(
- prune_mins(rules, self.precision_min, self.recall_min),
- self.max_depth_duplication
- )
|