Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

text_utils.py 955 B

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  1. # IPA Phonemizer: https://github.com/bootphon/phonemizer
  2. _pad = "$"
  3. _punctuation = ';:,.!?¡¿—…"«»“” '
  4. _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
  5. _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
  6. # Export all symbols:
  7. symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa)
  8. dicts = {}
  9. for i in range(len((symbols))):
  10. dicts[symbols[i]] = i
  11. class TextCleaner:
  12. def __init__(self, dummy=None):
  13. self.word_index_dictionary = dicts
  14. print(len(dicts))
  15. def __call__(self, text):
  16. indexes = []
  17. for char in text:
  18. try:
  19. indexes.append(self.word_index_dictionary[char])
  20. except KeyError:
  21. print(text)
  22. return indexes
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...