Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

test_memmap.py 966 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
  1. import numpy as np
  2. import pickle
  3. import torch
  4. import tqdm
  5. from torch.nn import functional as F
  6. data = np.memmap("text_data.npy", dtype=np.float32, mode="w+", shape=(470_000, 20, 20, 256))
  7. data_index = [] # (key, strings, data_index)
  8. index = 0
  9. for i in range(47):
  10. print(f"Loading data {i}...")
  11. with open(f"./train_data/data_text_saves{i}.pk", "rb") as f:
  12. text_data = pickle.load(f)
  13. for seg in tqdm.tqdm(text_data):
  14. data_index.append((seg[0], seg[1], index))
  15. text_segs = seg[2]
  16. text_segs = torch.from_numpy(text_segs)
  17. max_tokens = 20
  18. text_segs = F.pad(text_segs, pad=(0, 0, 0, max_tokens - text_segs.shape[0], 0, max_tokens - text_segs.shape[0]),
  19. mode='constant', value=0)
  20. text_segs = text_segs[:max_tokens, :max_tokens, :]
  21. data[index] = text_segs
  22. index += 1
  23. with open("./train_data/data_index.pk", "wb") as f:
  24. pickle.dump(data_index, f)
  25. data.flush()
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...