Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

process_data.py 583 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  1. import pandas as pd
  2. import yaml
  3. def process_data(frac=0.1, split="train"):
  4. df = pd.read_csv("data/raw/{}.csv".format(split))
  5. df.columns = ["Unnamed: 0", "input_text", "output_text"]
  6. df_new = df.sample(frac=frac, replace=True, random_state=1)
  7. df_new.to_csv("data/processed/{}.csv".format(split))
  8. if __name__ == "__main__":
  9. with open("data_params.yml") as f:
  10. params = yaml.safe_load(f)
  11. process_data(frac=params["split"], split="train")
  12. process_data(frac=params["split"], split="test")
  13. process_data(frac=params["split"], split="validation")
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...