Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Quality model extension #174

Open
hylkevds opened this issue Jan 25, 2024 · 15 comments
Open

Data Quality model extension #174

hylkevds opened this issue Jan 25, 2024 · 15 comments
Labels
data model sensing v2.0 This change should be discussed for v2.0 of the sensing document.

Comments

@hylkevds
Copy link
Contributor

Since no one really knows what to do with the resultQuality DQ_Element property on Observation, we need something better.
Some proposals are currently being worked on.

@hylkevds hylkevds added sensing v2.0 This change should be discussed for v2.0 of the sensing document. data model labels Jan 25, 2024
@securedimensions
Copy link
Collaborator

One proposal / idea results from the work in the EU CitiObs project. It is understood that the use of the DQ_Element or the use of the Observation parameters puts heavy burdon on the database when filtering on Observations WHERE data-quality equals xyz.

One possible way to improve filtering performance is to introduce a seprate DataQuality entity that keeps the relevant information. Any observation meeting the detail expressed in a DataQuality instance is then simply linked. This allows to filter on the DataQuality entities and then fetch the associated observations.

The other aspect that was identified is that DataQuality of a sensor may depend on time. So, the extension proposal for the data model includes another entity HistoricalDataQuality. This allows to produce a timeseries for data quality.

The following figure illustrates the proposal. The details of the DataQuality entity and the naming used is potentially misleading. But, because we want to contribute to a standardized solution and stimulate a good discusion, we share the following diagram

Data Quality Extension Diagram

The DataQuality entity is linked to Sensor allowing the expression of information that can be applied to obaservations created by the sensor. The DataQuality entity also links to Observation and the STAplus ObservationGroup which allows to separate data quality for single observations and a collection of observations. To express the timeseries of data quality related to a sensor, the HistoricalDataQuality is linke to Sensor.

The diagramm and all illustrations above represent the current snapshot of thinking. It will for sure evolve over the next month...

@humaidkidwai
Copy link
Collaborator

It's interesting to see DataQuality linked to Sensor, makes it easier to manage the maintenance of faulty Sensors.

Would it not be more appropriate to have the DataQuality linked to Datastream and store the DataQuality attribute in Datastream?

  • one possible reason to do so would be to avoid redundancy, since all Observations in a Datastream are more likely to share certain attributes such as accuracy, precision, etc
  • another reason could be the Quality of the data would be more viable to judge at an aggregate level rather individual Observations, some of which might be outliers.

The attributes for the DQ Entity do require more information

@schaefed
Copy link

schaefed commented Feb 7, 2024

The following proposal to represent quality information in STA 2.0 is focused on the requirements of dedicated tools for anomaly detection, like, for example, saqc.

Without going into too much detail here, the basic functionality of such tools can be described as follows: Software for anomaly detection can usually be understood as a collection of algorithms implemented as functions or methods, tailored towards a certain kind of anomaly (e.g., algorithms to detect outliers, scatter, constant values, ...). In the following, those functions/methods will be called 'quality measures'. Applied to a given data stream, those tools usually output some kind of quality information in the following called 'quality features', often also called "flags." When used, these tools are usually configured to apply certain quality measures on an input datastream. Those measures themselves are usually parameterized towards the characteristics of the input data. In most cases, several quality measures will be used to detect different kinds of anomalies within the same datastream. The output will be another datastream of quality features, usually, but not necessarily, of the same length as the input data. The exact representation of the quality features is heavily dependent on the use case, and to my knowledge, no widely accepted standards exist (although I'd love to be proven wrong on this). In practice, I have seen quite different quality feature schemes (i.e., definitions of quality features and their interrelation), ranging from a set of concrete labels (like "OK," "SUSPICIOUS," "BAD") over integers within given bounds (1-10, 0-255) to 'continuous' representation as real numbers like in the interval [0, 1]. The main point here is that basically everything could be a quality feature and therefore part of a quality feature scheme.

The representation of quality features, especially if this representation should be traceable or even reproducible, needs to fulfill several requirements:

  • It should be flexible enough to allow quality features of any data type.
  • It needs a way to provide meta information to the quality measure (i.e., what kind of algorithm was applied, a reference to the method description and/or implementation).
  • It needs to store the parameters used to adjust a quality measure to a given data stream.
  • It should have the possibility to 'bundle' several quality measures into a larger setup.

The following diagram shows a possible realization of those requirements as an extension of the STA 1.1 data model:

STA-QC(5) drawio

This scheme allows describing different QualityMeasure, to bundle them together with concrete arguments into a QualityScheme, which is related to the actual QualityFeatures. A QualityFeature is either associated with an Observation or a Datastream (to also provide aggregated quality information). The QualityFeature itself is specified by the QualityScheme, which at least defines its data type.

@securedimensions
Copy link
Collaborator

First approach of a consolidated Data Model

STA-DataQuality drawio xml
STA version 1.1

STAPlus-DataQuality drawio xml
STAplus version 1.1

@schaefed
Copy link

schaefed commented Mar 25, 2024

The latest version of the consolidated model:

STA-DataQuality drawio(5)(2) drawio(3)

STA-DataQuality drawio(5)(2)(3) drawio

@hylkevds
Copy link
Contributor Author

For V2 we're renaming the parameters field to properties, so that should probably also be used here.
There seems to be a Typo in the diagram: HistoricalQualitylFeature

@hylkevds
Copy link
Contributor Author

hylkevds commented Apr 13, 2024

Question about the cardinalities:
HistoricalQualityFeature can have multiple Sensors and multiple Observations. What does this mean?
I would expect a HistoricalQualityFeature to be the QualityFeature(s) a single Sensor or Observation had at a given point in time, so I would expect the HistoricalQualityFeature to have 1 Sensor or 1 Observation.

Also, how many QualityFeatures can an Observation have? The latest diagram has 0..1 but the original had 0..*.

@hylkevds
Copy link
Contributor Author

We may get questions about the term QualityFeature, and its overlap with the existing Feature class. Are there alternative names that could be considered?

@nbrinckm
Copy link

Regarding the term: How about „QualityAnnotation“?

@schaefed
Copy link

I would expect a HistoricalQualityFeature to be the QualityFeature(s) a single Sensor or Observation had at a given point in time, so I would expect the HistoricalQualityFeature to have 1 Sensor or 1 Observation.

If we allow to 'reuse' a QualityAnnotation for several Observations/Sensors (which makes at least sense for categorical annotations) don't we then need the possibility to 'reuse' HistoricalQualiotyAnnotations as well? My use case would be a Datastream, whose Observations received two different QualityAnnotations 'BAD' and 'OK'. At one point in time, we generate a new set of QualityAnnotations for the same datastream and move all existing annotations to their historical counterpart. In order to realize such an example we need the given cardinalities, don't we?

Also, how many QualityFeatures can an Observation have? The latest diagram has 0..1 but the original had 0..*.

I changed the cardinality from 0..* to 0..1 because of the introduction of HistoricalQualityAnnotation. The idea behind the changes was, that there should only be on valid quality information to a Observation/Sensor at any given point in time. But actually I am not sure anymore, as there might be compound QualityAnnotations as well (think of uncertainty ranges for example and distribution-like quality representations). Currently I think it would make sense to keep the 0..1 cardinality and allow for compound data types in QualityAnnotation.result, but I am actually unsure, if this makes sense. Any thought on this?

@nbrinckm
Copy link

nbrinckm commented Jun 7, 2024

Stupid question from my side:

With regard to the new entity types that we want to add here - how do they related to the Data Quality Vocabulary?
Can it be used or are there major problems?

I'm not really into the topic, got the hint from a colleague. If this is already part of your ideas, them I'm completely fine with this.

@hylkevds
Copy link
Contributor Author

hylkevds commented Jun 7, 2024

I would expect a HistoricalQualityFeature to be the QualityFeature(s) a single Sensor or Observation had at a given point in time, so I would expect the HistoricalQualityFeature to have 1 Sensor or 1 Observation.

If we allow to 'reuse' a QualityAnnotation for several Observations/Sensors (which makes at least sense for categorical annotations) don't we then need the possibility to 'reuse' HistoricalQualiotyAnnotations as well?
My use case would be a Datastream, whose Observations received two different QualityAnnotations 'BAD' and 'OK'. At one point in time, we generate a new set of QualityAnnotations for the same datastream and move all existing annotations to their historical counterpart. In order to realize such an example we need the given cardinalities, don't we?

You mean you have a number of Observations in a Datastream, that are all linked to the QualityAnnotations 'BAD' and 'OK'? And then you change the QualityAnnotations for all those Observations to something else?
In one model we'd get a HistoricalQualiotyAnnotation per Observation / Sensor, in the other only one.

Both models would work. The question is which one is easier to implement for the client. Which, and how many POST/PATCH requests are needed, and what can be automated. If one changes each Observation to link to new QualityAnnotations, and that automatically causes the creation of a new HistoricalQualityAnnotation with the current time, then we would end up with one-per, since each PATCH would have a different time. If one were to create a new HistoricalQualityAnnotation, with a fixed time, and all Observations linked, the server could check for each Observation if this is the newest HistoricalQualityAnnotation, and automatically re-link each Observation to the correct QualityAnnotation. In that case the many-per would work...
For HistoricalLocations both methods work.

If there is no automation in the background, any model would work, but that may be complex for the client...

Also, how many QualityFeatures can an Observation have? The latest diagram has 0..1 but the original had 0..*.

I changed the cardinality from 0..* to 0..1 because of the introduction of HistoricalQualityAnnotation. The idea behind the changes was, that there should only be on valid quality information to a Observation/Sensor at any given point in time. But actually I am not sure anymore, as there might be compound QualityAnnotations as well (think of uncertainty ranges for example and distribution-like quality representations). Currently I think it would make sense to keep the 0..1 cardinality and allow for compound data types in QualityAnnotation.result, but I am actually unsure, if this makes sense. Any thought on this?

Compound data types are harder for the client to deal with. And depending on the combinations that are possible, there may be very many different QualityAnnotations... That would require some real-world testing to see which is better.

@schaefed
Copy link

With regard to the new entity types that we want to add here - how do they related to the Data Quality Vocabulary?

Well, actually not. I am not an expert on the Data Quality Vocabulary (DQV) either but I read into it a bit and found, that we have the following, more or less congruent, definitions:

  • QualityMeasure -> 'Quality Metric'
  • QualityAnnotation -> 'Quality Measurement'
  • QualitySetup -> 'Quality Annotation'

Furthermore DQV defines another entity the 'Quality Dimension', which acts as an abstract category for a number of 'Quality Metrics', that each cover a particular aspect of what it means for data to be considered "good" or "fit for purpose" (e.g. the Data 'Quality Dimension' Accuracy could cover 'Quality Metrics' that provide 'Quality Measurements' like error margins, absolute errors, z-scores, ...).

I am not sure, that we need something like the 'Quality Dimension' and I don't see much benefit in introducing another higher-level abstraction. If we would want do be compatible with / translatable into the DQV we likely should add the 'Quality Dimension' as well.

I like the terminology however and suggest the following renaming:

  • QualityMeasure -> QualityMetric
  • QualityAnnotation -> QualityMeasurement or, as the term 'measurement' is not used in STA at all, maybe QualityObservation?
  • QualitySetup -> QualityAnnotation I am not sure if I really like the term QualityAnnotation here, but I dislike QualitySetup even more...

@schaefed
Copy link

You mean you have a number of Observations in a Datastream, that are all linked to the QualityAnnotations 'BAD' and 'OK'? And then you change the QualityAnnotations for all those Observations to something else?
In one model we'd get a HistoricalQualiotyAnnotation per Observation / Sensor, in the other only one.

Both models would work. The question is which one is easier to implement for the client.

Then I would suggest to use the model, that is easier to implement and I leave the decision to you @hylkevds.

Compound data types are harder for the client to deal with.

I guess for the moment we could restrict the available types to scalars and fixed type arrays . If we stick with the QualityAnnotation as it currently is, then we likely need another QualityAnnotation-field, that allows us to distinguish the different annotations of a given QualityMeasure (e.g. we model an uncertainty range with two different QualityAnnotations, one for the lower bound of the range, the other for the upper bound of the range, how could we tell them apart?). Only regarding usability compound data types would be more convenient, I think.

@schaefed
Copy link

Also, how many QualityFeatures can an Observation have? The latest diagram has 0..1 but the original had 0..*.

No matter how we decide on compound data types, we should allow multiple QualityAnnotations for an Observartion in order to support hierarchical quality information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data model sensing v2.0 This change should be discussed for v2.0 of the sensing document.
Projects
None yet
Development

No branches or pull requests

5 participants