Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

preprocess.py 993 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  1. import sys
  2. from transformers import AutoTokenizer
  3. dataset = sys.argv[1]
  4. model_name_or_path = sys.argv[2]
  5. max_len = int(sys.argv[3])
  6. subword_len_counter = 0
  7. tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
  8. max_len -= tokenizer.num_special_tokens_to_add()
  9. with open(dataset, "rt") as f_p:
  10. for line in f_p:
  11. line = line.rstrip()
  12. if not line:
  13. print(line)
  14. subword_len_counter = 0
  15. continue
  16. token = line.split()[0]
  17. current_subwords_len = len(tokenizer.tokenize(token))
  18. # Token contains strange control characters like \x96 or \x95
  19. # Just filter out the complete line
  20. if current_subwords_len == 0:
  21. continue
  22. if (subword_len_counter + current_subwords_len) > max_len:
  23. print("")
  24. print(line)
  25. subword_len_counter = current_subwords_len
  26. continue
  27. subword_len_counter += current_subwords_len
  28. print(line)
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...