Skip to content

Sphnc 197

Nicola Stoira requested to merge sphnc-197 into main

Proposal is to have an additional optional flag that can be configured in the scrambleField de-identification rule:

"scrambleField": {
    "defaultScrambling": {
        "applies_to_fields": ["sphn:SubjectPseudoIdentifier/id", "sphn:SubjectPseudoIdentifier/sphn:hasIdentifier", "sphn:AdministrativeCase/id", "sphn:AdministrativeCase/sphn:hasIdentifier", "sphn:Sample/id", "sphn:Sample/sphn:hasIdentifier", "sphn:hasSample/id", "sphn:hasSample/sphn:hasIdentifier", "sphn:hasAllergen/sphn:hasCode/id"],
        "anonymize": true
    }
}
  • anonymize: false: same logic as before
  • anonymize: true: when a patient is processed, instead of extracting from de_identification_scrambling the previously scrambled values for the patients, we set scrambling_map={}. This means that all fields will be scrambled from scratch at the first encounter. We still store the new scrambled values in the table de_identification_scrambling such that the downloaded data has the scrambled patient ID (if de-identified) (added ON CONFLICT DO UPDATE SET statement to insert statement). Repeated processing of a patient will give different results.

UPDATE: in order to "forget" about the ingested patient we apply the following additional steps:

  • We do not write any record in de_identification history table if anonymize: true.
  • Table de_identification_scrambling has an additional column anonymized which is used later on for cleanup and scrambled patient ID extraction.
  • If one of the scrambling de-id rules has anonymize: true we extract the scrambled patient ID (if it has been scrambled) from de_identification_scrambling table and update the internal patient ID in the connector. Patient file is written to refined zone with the scrambled ID and that ID is used in the tables refined_zone, graph_zone, release_zone.
  • When the pre-checks are run we cleanup the table de_identification_scrambling from all the records with anonymized=True.

Warnings:

  • When a patient is re-ingested, the anonymized patients in release, graph, refined zones won't be removed because we do not know that those patients are related to the re-ingested patient. A full reset project is needed to cleanup the data.
  • The patient logs when extracted for anonymized patients, will be split in the sense that logs up to refined_zone will be related to the original patient ID, while logs from integration step will be related to the anonymized ID.
Edited by Nicola Stoira

Merge request reports