Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

data-preprocessing.py 1.2 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
  1. import pandas as pd
  2. from sklearn.feature_extraction.text import CountVectorizer
  3. from sklearn.model_selection import train_test_split
  4. from src.const import *
  5. import string
  6. print(M_PRO_INIT, '\n' + M_PRO_LOAD_DATA)
  7. data = pd.read_csv(RAW_DATA_PATH)
  8. print(M_PRO_RMV_PUNC)
  9. clean_text = data[TEXT_COL_NAME].map(lambda x: x.lower().replace('\n', ''). \
  10. translate(str.maketrans('', '', string.punctuation)))
  11. print(M_PRO_LE)
  12. y = data[TARGET_COL].map({CLASS_0: 0, CLASS_1: 1})
  13. print(M_PRO_VEC)
  14. # every column is 1-2 words and the value is the number of appearance in Email
  15. email_text_list = clean_text.tolist()
  16. vectorizer = CountVectorizer(encoding='utf-8', decode_error='ignore', stop_words='english',
  17. analyzer='word', ngram_range=(1, 2), max_features=500)
  18. X_sparse = vectorizer.fit_transform(email_text_list)
  19. X = pd.DataFrame(X_sparse.toarray(), columns=vectorizer.get_feature_names())
  20. print(M_PRO_SPLIT_DATA)
  21. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
  22. print(M_PRO_SAVE_DATA)
  23. X_train.to_csv(X_TRAIN_PATH, index=False)
  24. X_test.to_csv(X_TEST_PATH, index=False)
  25. y_train.to_csv(Y_TRAIN_PATH, index=False)
  26. y_test.to_csv(Y_TEST_PATH, index=False)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...