Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

split-train-test.py 1.6 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
  1. import pandas as pd
  2. from sklearn.model_selection import train_test_split
  3. import os
  4. # Const
  5. DATASET_COLUMNS = ["sentiment", "ids", "date", "flag", "user", "text"]
  6. DATASET_ENCODING = "ISO-8859-1"
  7. TARGET_COL = 'sentiment'
  8. CSV_PATH = 'raw-data/twitter-sentiment-analysis-sentiment140dataset.csv'
  9. NEW_DIR = 'split-data'
  10. X_TRAIN_PATH = 'split-data/X_train.csv'
  11. X_TEST_PATH = 'split-data/y_train.csv'
  12. Y_TRAIN_PATH = 'split-data/X_test.csv'
  13. Y_TEST_PATH = '../y_test/y_test.csv'
  14. TEST_SIZE = 0.03
  15. print("Read raw data")
  16. df = pd.read_csv(CSV_PATH, encoding=DATASET_ENCODING, names=DATASET_COLUMNS)
  17. print(f'data set shape {df.shape}')
  18. print("Replace target col values")
  19. df[TARGET_COL] = df[TARGET_COL].replace(0, 1) # Negative
  20. df[TARGET_COL] = df[TARGET_COL].replace(4, 0) # Positive
  21. os.makedirs(NEW_DIR, exist_ok=True)
  22. print("Split dataset to train and test")
  23. X_train, X_test, y_train, y_test = train_test_split(df.drop(TARGET_COL, axis=1), df[TARGET_COL],
  24. test_size=TEST_SIZE, random_state=42,
  25. stratify=df[TARGET_COL])
  26. _, X_train, _, y_train = train_test_split(X_train, y_train,
  27. test_size=0.1, random_state=42,
  28. stratify=y_train)
  29. print("Save data sets to csv")
  30. X_train.to_csv(X_TRAIN_PATH, index=False)
  31. y_train.to_csv(X_TEST_PATH, index=False)
  32. X_test.to_csv(Y_TRAIN_PATH, index=False)
  33. # y_test will be saved outside of the repo - to prevent cheating.
  34. y_test.to_csv(Y_TEST_PATH, index=False)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...