Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

data-preprocessing.py 1.1 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  1. import pandas as pd
  2. from sklearn.feature_extraction.text import CountVectorizer
  3. from sklearn.model_selection import train_test_split
  4. from const import *
  5. import string
  6. print(M_PRO_INIT, '\n' + M_PRO_LOAD_DATA)
  7. data = pd.read_csv(RAW_DATA_PATH)
  8. print(M_PRO_RMV_PUNC)
  9. clean_text = data[TEXT_COL_NAME].map(lambda x: x.lower().replace('\n', ''))
  10. print(M_PRO_LE)
  11. y = data[TARGET_COL].map({CLASS_0: 0, CLASS_1: 1})
  12. print(M_PRO_VEC)
  13. # every column is 1-2 words and the value is the number of appearance in Email
  14. email_text_list = clean_text.tolist()
  15. vectorizer = CountVectorizer(encoding='utf-8', decode_error='ignore', stop_words='english',
  16. analyzer='word', ngram_range=(1, 2), max_features=500)
  17. X_sparse = vectorizer.fit_transform(email_text_list)
  18. X = pd.DataFrame(X_sparse.toarray(), columns=vectorizer.get_feature_names())
  19. print(M_PRO_SPLIT_DATA)
  20. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
  21. print(M_PRO_SAVE_DATA)
  22. X_train.to_csv(X_TRAIN_PATH, index=False)
  23. X_test.to_csv(X_TEST_PATH, index=False)
  24. y_train.to_csv(Y_TRAIN_PATH, index=False)
  25. y_test.to_csv(Y_TEST_PATH, index=False)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...