Photo by SIMON LEE on Unsplash

Sophos/ReversingLabs 20 Million malware detection dataset Dataset for Machine Learning

Install DagsHub:

pip install dagshub
Click on copy button to copy content

To stream this data directly on DagsHub

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/sorel-20m-dataset")

fs.listdir("s3://sorel-20m/")
Click on copy button to copy content

Description

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the samples.

Additional information

Update frequency

At most annually

Managed by

Sophos AI

License

See the Terms of Use

Related datasets

New York City Taxi and Limousine Commission (TLC) Trip Record Data

Open Observatory of Network Interference (OONI)

SUCHO Ukrainian Cultural Heritage Web Archives

The MIT Supercloud Dataset

Launch your ML development to new heights with DagsHub

Back to top