Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

featurization.py 1.4 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
  1. """
  2. Create feature CSVs for train and test datasets
  3. """
  4. import json
  5. import numpy as np
  6. import pandas as pd
  7. from sklearn.decomposition import PCA
  8. import pickle
  9. import base64
  10. def featurization():
  11. # Load data-sets
  12. print("Loading data sets...")
  13. train_data = pd.read_csv('./data/train_data.csv', header=None, dtype=float).values
  14. test_data = pd.read_csv('./data/test_data.csv', header=None, dtype=float).values
  15. print("done.")
  16. # Create PCA object of the 15 most important components
  17. print("Creating PCA object...")
  18. pca = PCA(n_components=15, whiten=True)
  19. pca.fit(train_data[:, 1:])
  20. train_labels = train_data[:, 0].reshape([train_data.shape[0], 1])
  21. test_labels = test_data[:, 0].reshape([test_data.shape[0], 1])
  22. train_data = np.concatenate([train_labels, pca.transform(train_data[:, 1:])], axis=1)
  23. test_data = np.concatenate([test_labels, pca.transform(test_data[:, 1:])], axis=1)
  24. print("done.")
  25. # END NEW CODE
  26. print("Saving processed datasets and normalization parameters...")
  27. # Save normalized data-sets
  28. np.save('./data/processed_train_data', train_data)
  29. np.save('./data/processed_test_data', test_data)
  30. # Save learned PCA for future inference
  31. with open('./data/norm_params.json', 'w') as f:
  32. pca_as_string = base64.encodebytes(pickle.dumps(pca)).decode("utf-8")
  33. json.dump({ 'pca': pca_as_string }, f)
  34. print("done.")
  35. if __name__ == '__main__':
  36. featurization()
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...