SIEVE: Cybersecurity Log Dataset Collection for SIEM Event Classification

SIEVE (SIem Ingesting EVEnts) is a collection of 6 different synthetic datasets containing logs specifically designed for training machine learning models in the log classification tasks typically used by SIEMs. Created using the innovative SPICE (Semantic Perturbation and Instantiation for Content Enrichment) technique, this dataset addresses the critical shortage in the available literature of datasets containing diverse and labeled security events. SIEVE contains multiple instantiations with various levels of synthetic perturbation, making it ideal for training NLP classification models that can effectively categorize security events produced by different systems and applications. The dataset was constructed from publicly available log samples and transformed through our innovative text enrichment methodology to create realistic and diverse log entries that retain the semantic characteristics of authentic security logs.
For a detailed description of the dataset, refer to:

 

P. Artioli, G. Pellegrini, A. Magrì, V. Dentamaro, S. Galantucci, G. Semeraro : “SIEVE: Generating a Cybersecurity Log Dataset Collection for SIEM Event Classification, Computer Networks [LINK TO DOI].

Data set format

The datasets are in CSV format to facilitate immediate use in machine learning pipelines with the following header columns:

  • category: the categorization field that captures the action taken as it was described by the source (e.g., authentication-success, http-request-success, process-started, user-deletion)
  • log: The raw log entry
Data set classes

The datasets included 30 balanced event classes manually assigned by a panel of cybersecurity experts using the Elastic Common Schema event categorization guidelines. To achieve general consensus and avoid conflicts, the experts performed two rounds of blind reevaluation on 20 percent of the randomly sampled patterns, resulting in a Krippendorff alpha (substantial agreement) score of 0.82.

Data set request procedure

To access the SIEVE dataset, send an email request to sieve.requests@bvtech.com containing:

  • Your name and contact information
  • Your affiliation (university, research institution, or company)
  • A brief description of the intended use of the dataset
  • Confirm that you will cite the source of the SIEVE dataset in any resulting publications or applications as follows:

P. Artioli, G. Pellegrini, A. Magrì, V. Dentamaro, S. Galantucci, G. Semeraro : “SIEVE: Generating a Cybersecurity Log Dataset Collection for SIEM Event Classification, Computer Networks [LINK TO DOI].

Follow us on social media

GROTTAGLIE:
Corso Europa, 3
74023 Grottaglie (TA)
Tel.: +39.02.8596171
Fax: +39.02.89093321

 

RUTIGLIANO:
S.P. 84 Adelfia-Rutigliano, C.da Caggiano
70018 Rutigliano (BA)
Tel.: +39.02.8596171
Fax: +39.02.89093321

Project funded by the European Regional Development Fund Puglia POR Puglia 2014 - 2020 - Axis I - Specific Objective 1a - Action 1.1 (R&D), and with the support of the University of Bari and the Massachusetts Institute of Technology (MIT).

Privacy and Cookie Policy