A data model and file format to represent and store high frequency energy monitoring and disaggregation datasets | Scientific Reports –

Data model

The data model that supports EMD-DF and EMD-DF64 is comprised of several data entities that should be present in a dataset to make it suitable for NILM research. Figure 1 shows an illustration of the proposed data model using the Unified Modeling Language (UML) (see notation. Overall, there are three main data entities: (1) consumption, (2) ground-truth, and (3) annotations. These are described next.

Figure 1
figure 1

EMD-DF64: Data model overview.

The consumption entity represents all the data elements that refer to energy consumption time-series. Consumption data can be of two different types: (i) raw waveforms, i.e., current and voltage; or (ii) processed waveforms, i.e., different metrics like real and reactive power. Furthermore, consumption data can represent both aggregated and disaggregated consumption. Finally, the latter can represent the consumption of individual appliances (e.g., a kettle) or an individual circuit (e.g., kitchen outlets that may or not contain the kettle). It is important to remark that all the entities that refer to consumption data are optional (cardinality of 0 or more) to cover as many variations of NILM datasets as possible. For example, BLUED contains only aggregated consumption data, while SustDataED13 contains aggregated and individual appliance consumption. Likewise, RAE23 contains data for aggregated, individual appliances and individual circuit consumption. Finally, it is also possible to find datasets that do not contain any form of aggregated consumption, e.g., PLAID24 , Tracebase25 and GREEND26.

On the right-hand side of Fig. 1 is the ground-truth entity. This entity is mandatory on a NILM dataset, and can be of four different types: (i) individual appliance consumption, (ii) individual circuit consumption, (iii) appliance activity, and (iv) user activity. Individual appliance and individual circuit consumption are a special type of consumption data used to train, test, and validate event-less approaches. On the other hand, appliance activities provide information about the power events that exist on a dataset and are required to train, validate, and test event-based approaches.

We have also introduced the concept of user activities, which is straightforward terms refer to actions that people perform involving the use of electric appliances, e.g., doing the laundry (involves clothes washer, clothes dryer, and iron) or preparing a meal (oven, stove, microwave, aid choppers, and blenders). Such user activities are important to evaluate different NILM application domains, e.g., Non-Intrusive Activity Detection (NIAD)27, and the detection of abnormal consumption behaviors28. It is important to note that one individual appliance activity can only be associated with one user activity (cardinality of 0 or 1). Otherwise, the total consumption of the user activities will be larger than the total consumption of the individual appliances, thus introducing inconsistency to the data model.

Lastly, we have the annotations entity. There are three types of optional annotations: (1) RIFF Meta, which are the default metadata chunks defined by RIFF, (2) comments, consisting of free text, and 3) user-defined metadata. The latter can be either which can be either local metadata, which refers to specific samples in the consumption data, or custom metadata annotations defined by the dataset creator and can serve multiple purposes. For example, in the current implementation, it is possible to provide a list with details for individual appliances and user activities and embed the annotations from the NILM metadata project.

Data structure

In EMD-DF64, the different data entities are represented by extending the 64-bit Sony WAVE64 (SW64) file format. SW64 is an application of the RIFF in which the file contents are grouped and stored in separate chunks. Each chunk consists of three components: (1) chunk identifier—128-bit globally unique identifier (GUID), e.g., “fmt” and “data”; (2) an unsigned, little-endian 64-bit integer representing the length of the chunk; (3) the chunk data. Finally, like all other RIFF-based formats, if a chunk’s data size is not even, it is padded by 1 byte to make it so.

A WAVE64 file is composed of several chunks, four of which are mandatory. Furthermore, in a correctly formatted WAVE64 file, the first four bytes (GUID) must always spell out “riff” (lower-case). Some of the W64 chunks are briefly described in Table 1, with a particular focus on those reused in the EMD-DF data model. For more details about the RIFF standard, please refer to the original project documentation in29.

Table 1 List of chunks that compose the RIFF-WAVE64 file format.

The EMD-DF64 file format is based on 20 chunks. One directly inherited from the RIFF standard (Info), nine from the WAVE64 format (Format, Data, Cue, List, Associated Data List, Label, Note and Labeled Text Chunk), and the remaining 10 are custom chunks. The chunk structure of EMD-DF64 is illustrated in Fig. 2. Table 2 provides a description of the 10 custom chunks.

Figure 2
figure 2

Chunk structure of the EMD-DF 64 file format.

Table 2 List of EMD-DF64 specific chunks.

EMD-DF64 file format definition

Next, we present how the different chunks that compose the EMD-DF64 data structure are combined to create a dataset. We first describe how the data format is defined in the Format and Config chunks. Then we show how the power measurements are stored and supplemented with the different embedded annotations.

Waveform data format

The waveform data (i.e., consumption data) must be defined in the Format chunk. This is inherited from the W64 format and consists of the following fields: (i) sample size in bits (8, 16, 24, 32 or 64 bits); and (ii) number of individual channels (greater or equal to 1).

Additionally, all the sub-chunks defined in the Config list chunk are mandatory. More precisely: (i) timezone (the time zone of the location where the data was collected), (ii) timestamp (the Unix timestamp of the first sample in the waveform data), (iii) sampling rate (the number of samples per second in the waveform data), and (iv) calibration constants (zero or one for each waveform channel). The calibration constant chunks are associated to each channel in ascending order. For the model to be valid, the number of calibration chunks must be zero (i.e., no calibration is needed) or equal to the number of individual channels.

Consumption data

The consumption data are represented using the Data, Missing Data List, and Missing Data chunks. The Missing Data List wraps sequences of Missing Data chunks to support datasets with missing data.

Waveform data are stored uncompressed in Data chunks. If only one metric needs to be represented (this is the case in most individual appliance and circuit ground-truth data), the samples are stored consecutively; otherwise the samples are stored interleaved. Each sample S is represented by a value between − 1 and 1. Samples are stored in little-endian format (i.e., the least significant byte is stored first). The bits that represent the sample amplitude are stored in the most significant bits of S, and the remaining bits are set to zero.

Handling missing data

Intervals with missing data are represented using the Missing Data chunk. Each of these chunks contains a JSON string with information about the timestamp when data is again available, and the number of the sample where this happens.

figure a
Timestamp conversion

Since we do not store the timestamp of each individual sample, this has to be calculated in run-time. This is done using Eq. (1), which returns a Unix timestamp in milliseconds:

$$\begin{aligned} unix\_timestamp = 1000 \times \frac{current\_sample – initial\_sample}{f} + initial\_unix\_timestamp \end{aligned}$$


Where current_sample is the position of the sample of interest, initial_sample is the position of the first sample (if there are no missing data chunks the first sample is 1, otherwise it is the initial sample of the corresponding missing data chunk). initial_unix_timestamp is the unix timestamp of the first sample (if there are no missing data it is given by the Timezone chunk, otherwise it is calculated from the corresponding missing data chunk). Finally, f is the sampling rate of the waveform data.

Conversely, it is possible to convert a unix timestamp to a sample position. This is done using Eq. (2):

$$\begin{aligned} position = \frac{actual\_timestamp – initial\_timestamp}{\frac{1}{f} \times 1000} \end{aligned}$$


Where actual_timestamp is the timestamp in milliseconds to be mapped to an audio position, initial_timestamp is the timestamp in milliseconds of the first sample in the dataset, and f is the sampling rate of the waveform data.

Ground-truth: individual appliance activity

Individual appliance activities correspond to the changes in the power consumption that are triggered by different appliance turning ON, OFF, or changing their working mode (e.g., low to high).

Individual appliance activity (i.e., power events), are embedded in the file using the Cue, Associated Data List and Label chunks. This is done as follows: (1) For each power event, an entry is added in the Cue chunk, (2) for each Cue chunk entry, a Label chunk is created and added to the Associated Data List chunk.

Each label chunk consists of a sample position in the waveform data and a JSON formatted string with the details of the respective activity. The sample position is calculated from the power event timestamp using Eq. (2): For example, the JSON in Listing 2 corresponds to a refrigerator activity that was mapped to position 19394633:

figure b

Ground-truth: user activity

User activities refer to actions that are performed involving the use of electric appliances, e.g., doing the laundry (involves clothes washer, clothes dryer and iron) or preparing a meal (oven, stove, microwave, aid choppers, blenders, etc.).

User activities are supplemented in the consumption data using the Cue, Associated Data List and Labeled Text chunks. This is done as follows: (1) For each user activity, an entry is added in the Cue chunk, (2) for each Cue chunk entry, a Labeled text chunk is created and added to the Associated Data List chunk.

Each labeled text chunk consists of a sample position in the waveform data, a duration in samples, and a JSON formatted string with the details of the respective activity. Listing 3 shows a JSON representing the “working on the computer” activity that involves using the desktop computer (App_ID: 1101), one monitor (App_ID: 1109) and a printer (App_ID: 1203). The duration in samples is obtained by subtracting the start from the end position of the activity. The sample positions are calculated from the timestamps using Eq. (2).

figure c

RIFF meta

Since EMD-DF64 is a direct application of the RIFF standard, it fully supports all the default RIFF metadata sub-chunks that are defined in the Info chunk. These are used to supplement general annotations, and include among others, Creator, Commissioner, Copyright, and Keywords. The RIFF metadata chunks supported in EMD-DF64 are listed in Table 3.

Table 3 List of RIFF metadata chunks supported by EMD-DF64.

Local metadata

Local metadata annotations can be used for instance to supplement datasets with details like the instant when a new appliance is added or removed from the electric circuit. Local metadata are used to supplement specific consumption samples with custom annotations. For instance, when a new appliance is added or removed from the electric circuit as shown in JSON Listing 4.

A note chunk consists of a sample position and a JSON formatted string. Notes are created using the Cue, Additional Data List and Note chunks as follows: (1) For each local annotation, an entry is added to the Cue chunk, (2) for each entry in the Cue chunk, a Note chunk is added to the Associated Data List chunk.

figure d

General metadata

These custom chunks can be used to enrich datasets with custom metadata. They are added using the Annotation List and Metadata chunks.

The content of such chunks do not follow any specific rule, yet it must be encoded in JSON and always include the ID and Label fields. EMD-DF64 fully supports three different custom metadata types: (i) appliances; (ii) user activities; and (iii) NILM metadata project annotations.

Appliances metadata

Keeps a list of the appliances that co-exist in the dataset, including the appliance characteristics like brand, model, energy consumption and energy efficiency rating. Listing 5 shows a possible JSON representation of the appliances metadata chunk.

figure e
User activities metadata

Keeps a list of the user activities that are present in the dataset, including a list of the appliances that can be associated with each activity. An example is provided in Listing 6.

figure f
NILM metadata

It is also possible to supplement datasets with annotations from the NILM metadata project. To this end we have defined the NILM Metadata Project annotation that can be used to embed the content of the different YAML files that compose the NILM Metadata project, in a metadata chunk. An example annotation is provided in Listing 7.

figure g


Custom comment chunks consist of free form text and are created using the Annotation List and Comment chunks. These can be used to add any kind of comments, for example, add a comment containing the historic of previous performance evaluations results on that particular file or dataset. Another example would be, adding a comment regarding some external event that could have affected the data.

JSON schemas

Since most of the annotation data will be done using the JSON format we have decided to use JSON schemas (see to describe the JSON data elements presented in the previous sub-section. Listing 8 shows a snippet of the JSON-Schema for the appliance activity labels.

figure h


This website uses cookies. By continuing to use this site, you accept our use of cookies.