Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
f24031f0cd
Initial commit
1 year ago
a306edd2de
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

Sophos/ReversingLabs 20 Million malware detection dataset

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/sorel-20m-dataset")

fs.listdir("s3://sorel-20m/")

Description:

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

Contact:

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

Update Frequency:

At most annually

Managed By:

Sophos AI

Resources:

  1. resource:
    • Description: Sophos/ReversingLabs 20 million sample dataset
    • ARN: arn:aws:s3:::sorel-20m/
    • Region: us-west-2
    • Type: S3 Bucket

Tags:

aws-pds, cyber security, deep learning, labeled, machine learning

Tutorials:

  1. tutorial:

Tools & Applications:

  1. tools & applications:

Publication:

  1. publication:
    • Title: SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection

    • URL: https://arxiv.org/abs/2012.07634

    • AuthorName: Richard Harang and Ethan M Rudd

Tip!

Press p or to see the previous file or, n or to see the next file

About

sorel-20m-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...