Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
..
11e76af0f9
Update README.md
6 months ago

README.md

You have to be logged in to leave a comment. Sign In

https://dagshub.com/syedzubeen/all-scam-spam

Description

This is a large corpus of 42,619 preprocessed text messages and emails sent by humans in 43 languages. is_spam=1 means spam and is_spam=0 means ham.

1040 rows of balanced data, consisting of casual conversations and scam emails in ≈10 languages, were manually collected and annotated by me, with some help from ChatGPT.

License

The All-Scam-Spam dataset is licensed under Apache 2.0.

Additional information

Sources

https://huggingface.co/datasets/sms_spam https://github.com/MWiechmann/enron_spam_data https://github.com/stdlib-js/datasets-spam-assassin https://repository.ortolang.fr/api/content/comere/v3.3/cmr-simuligne.html

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...