Team 4 - PEACH Data Anonymisation and Analytics Visualisation

Requirements Gathering

Requirements for this project was gathered mainly through interviews and meetings with our client, Dr Ramachandran. Unfortunately, other information gathering techniques such as questionnaires and shadowing users was not possible due to difficulty scheduling time with users, the limited responses from questionnaires and privacy concerns.

Before the initial meeting with our client, we had prepared some questions to ask during the meeting and the questions and their answers are recorded below.

What's the project about?

This project is all about data analytics and setting up a system for medical institutions to be able to use advanced data science techniques to analyse lots of their data. There are currently no equivalent systems in place but this can be something that can really change research. Additionally, another part of the project is data anonymisation, this is a piece of software which anonymises patient information but can also be used to generate novel data from the provided information.

What are the objectives of the project?

The main objectives to this project are to implement additions to the current analytics infrastructure and to develop the anonymisation software. The current analytics system use technologies including Apache Kafka and Apache NiFi. Currently, the system receives and preprocesses data in NiFi before passing it onto Kafka which is a messaging hub. In terms of this analytics infrastructure, this project should aim to integrate technologies such as Kibana and Elasticsearch into the system. Another main objective is to develop the anonymisation software which can be used to anonymise datasets so that it can be used to analytics systems.

Who are the users?

There are two main groups of users for the core analytics system, and they are medical professionals and researchers who wish to access the analytics data through an interactive and graphical visualisation. The users of the anonymisation software will be doctors who have access to sensitive patient information, they will use this to remove sensitive information from data they have access to before uploading the data to systems such as the core analytics.

Can we talk to the users?

Unfortunately, the team will not be able to talk with users due to privacy concerns as well as difficulty scheduling time with them. However, should potential users be interested in the project and wish to reach out to the team, Dr Ramachandran would pass on contact details.

How do you want to interact with the project? e.g) Mobile App, Web App

The visualisation tool should be be available to users through the web so should be a web application. The anonymisation tool must support the many different operating systems and their different versions in use by the NHS today but could be either a desktop application or web application. However, sensitive information should never leave the user’s computer.

Are there comparative products on the web already?

There are some existing analytics suites and anonymisation software but they are mostly commercial or very basic, and so project is all about creating something which is free and suitable for medical institutions.

How can we store the data? Is the data sensitive? Access rules?

Data can be stored within the analytics system but all data must be non-sensitive. This means that the anonymisation software must provide enough privacy and users cannot be identified from the output. The visualisation aspect of the analytics infrastructure should have access rules to only allow authenticated users onto the system.

Specific hosting required?

Since Dr Ramachandran has a large number of credits Azure account, the team should use Microsoft Azure to host the analytics infrastructure.

How to test?

The analytics system should be tested using fake data, possibly from open-source datasets since it would not be possible to use real data due to privacy issues. This would be the same for the anonymisation tool but should a potential user be interested in running the software, then the team would be able have the software tested in a real environment with real data.

Background knowledge (medicine) that we need to understand?

Since the technical aspects of the project are all technologically related, there is no need for any medical knowledge to understand the project.

Who will own the intellectual rights to the finished project?

This project will be owned by PEACH and it will be up to the group who decides how to distribute the results of the project.

After the initial meeting, we analysed the information we had gathered and created an initial set of requirements using in the MosCoW style. A core part of our project involved extending the existing core analytics system and alongside understanding the infrastructure of the system by reading available documentation, we created personae, storyboards and use cases to recognise the needs of the user from their perspective. Examples of these are provided further down.

The requirements were later refined after research of competing solutions had been done, a better understanding of the project and its scope gained and further communication and discussion with the client carried out. Once changes had been made, they were confirmed with Dr Ramachandran as the final version, and we began the next phase of the project.

Personae

Persona 1

James Brown is a very experienced doctor with a long history in cancer treatment and rather recently has read into some technical topics including big data. Using this knowledge, he wishes to anonymise real cancer patient datasets and the data in a clear and effective way. He wants to interact with graphs and be able to have the system analyse and process the data for him so that it shows trends and patterns which could help with cancer treatment.

Persona 2

Mary Davis is a data scientist who is carrying out research into cancer patients, she knows that the NHS systems have huge amounts of data but she cannot access it easily. She wishes to access this data by asking medical professionals to anonymise datasets. She then aims to write some data mining tools and processes and visualise the output of the processes carried out the large amounts of data. By analysing and visualising the data, Mary hopes to benefit doctors by finding patterns in the information.

Storyboards

Storyboard 1

A doctor wishes to analyse how effective a new type of medication is, and wants to correctly advise their patients. The doctor has all this sensitive information to hand, but is unable to manually anonymise and filter through it as there is too much.

As a result, the anonymisation tool is used to automatically remove any identifying information from the data and the result is then gathered within the PEACH core analytics system. The system combines this data with other sources to accurately and quickly analyse large amounts of data and display it in a clear fashion. Finally, the doctor is able use this information to improve patients' conditions and get them up and running again!

Storyboard 2

Whilst visualising data, a medical professional is overwhelmed with too much information and too many graphs. The doctor wishes to only see the information which they require, and in simple graphs and displays. By using the interface provided by the core analytics system, the doctor is able to view and understand specific graphs and thus, allows accurate decisions to be made much more quickly.

Use Cases

Use Case Diagram

List of Use Cases

UC1 - Anonymise data

Description

A user wishes to anonymise sensitive medical data.

Primary Actor

Doctor/Consultant

System

Anonymiser System

Pre-conditions

User has opened the data anonymiser on their local machine and has switched to the anonymisation option.

Main Flow

User selects an input file of type csv, json or xml
System reads the input file
System displays the list of columns, the data type and the anonymisation option
User edits the settings for a single column or edits multiple column settings in bulk
System opens a new window where the user can change the settings for the selected column(s)
User switches between 3 data anonymisation types: leave as is, anonymise/randomise or remove
User changes the data type for the selected column: name, gender, age, date, address, postcode, town, country, default
For the selected data types, the user can select ranges
System updates these settings
User selects an output file in a chosen file format
User selects the ‘Anonymise’ button
System anonymises the data
System saves the output in the output file
System displays an information on loss percentage

Post-Conditions

The system resets the input fields and settings.

Alternative Flow

None

UC2 - Generate data

Description

A user wishes to generate test data using an existing dataset.

Primary Actor

Doctor/Consultant

System

Anonymiser System

Pre-conditions

User has opened the data anonymiser on their local machine and has switched to the generation option.

Main Flow

User selects an input file of type csv, json or xml
System reads the input file
System displays the list of columns, the data type and the anonymisation option
User edits the settings for a single column or edits multiple column settings in bulk
System opens a new window where the user can change the settings for the selected column(s)
User switches between 3 data anonymisation types: leave as is, randomise or remove
User changes the data type for the selected column: name, gender, age, date, address, postcode, town, country, default
For the selected data types, the user can select ranges
System updates these settings
User selects an output file in a chosen file format
User selects the Generate button
System generates novel data
System saves the output in the output file

Post-Conditions

The system resets the input fields and settings.

Alternative Flow

None

UC3 - View raw data

Description

A user wishes to view raw data within the PEACH core analytics system through the visualiser.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Medical Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow

User logs into the web interface
User selects the dataset to view
System displays a list of entries
System displays keys and values for each entry
User can filter and sort the raw data

Post-Conditions

None

Alternative Flow

UC8 ERROR

UC4 - Visualise data

Description

A user wishes to visualise the data within the PEACH core analytics system in graphical displays. The graphs and displays have already been created.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow

User logs into the web interface
User selects to visualise the data
User selects the dataset to view
User selects from the available displays
User visualises the data in graphical displays

Post-Conditions

None

Alternative Flow

UC8 ERROR

UC5 - Interact with graphs

Description

A user wishes to visualise and interact with the data within the PEACH core analytics system.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface, logged in and opened up a visualisation of data. The frontend has successfully connected with the analytics infrastructure.

Main Flow

User can change the axis settings and intervals
User can alter the visible data displayed on the graphs
User can select points on the graph and more information be displayed

Post-Conditions

None

Alternative Flow

UC8 ERROR

UC6 - Creating graphs

Description

A user wishes to create new visuals displays within the PEACH core analytics system.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow

User logs into the web interface
User selects to create new visualisation
User selects the dataset to visualise
User selects graph type from available options
User sets intervals and axis settings
User creates new graphical display

Post-Conditions

None

Alternative Flow

UC8 ERROR

UC7 - Write queries

Description

A user wishes to write queries to gain further insight into the data.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow

User logs into the web interface
User enters query
System searches for data based on query
System displays output based on query
User can view and interact with the data

Post-Conditions

None

Alternative Flow

UC8 ERROR

UC8 - ERROR

Description

There is a system error with the core analytics or the visualisation web application.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface.

Main Flow

System detects an error
System displays the error message on screen

Post-Conditions

None

Alternative Flow

None

MoSCoW Requirements

Functional Requirements

Must Have

The data anonymisation tool must be able to anonymise input data provided by the user.
The anonymiser must anonymise the data locally due to privacy reasons, any identifying information must never leave the user’s computers.
The anonymiser must follow data protection and information governance rules.
The data visualisation tool must have the ability to visualise data provided by the PEACH core analytics system.

Should Have

The anonymiser should have the ability to accept data in different formats, including csv, json and xml.
The data anonymisation tool should anonymise data based on user selection of columns which contain sensitive data.
The anonymiser should tell whether it has reached a certain level of anonymity before the data can be used without risk.
The anonymiser should output the anonymised data in format requested by user.
The anonymiser should be able to generate novel data based on the inputs.
The visualiser should have the ability to visualise data in different formats, as desired by the user.
The visualiser should have filters so that the user could easily access any data required.

Could Have

The anonymiser could have the ability to output anonymised data to Apache Spark where the data could be analysed using the PEACH Core Analytics system.
Both the anonymiser and visualiser could have personalisation options to make software easy to use for individual user.
The visualiser could be some security measures in place to make sure that only authorised users can access the data.

Would Have

The anonymiser tool would use the IOTA blockchain to ensure data integrity.
The visualiser tool would provide an API for 3rd parties to access the data within the PEACH Core Analytics.

Non-functional Requirements

Must Have

The visualiser and anonymiser must be easy to use and intuitive so the user can use all the features of the system with only minimal training.
The visualiser and anonymiser must be supported by most operating systems (Windows, Linux, macOS).
The visualiser must be quick to load.

Should Have

The analytics infrastructure and visualisation system should be easily scalable.
The visualisation system should be reliable and high availability when deployed.
The visualisation system should operate automatically.

Could Have

The visualiser could have a high visibility mode for visually impaired users to use the software comfortably.
The visualiser could have a night mode to reduce eye strain when software is used in a dark environment.

Requirements

Project Background and Client

Project Goals

Requirements Gathering

Personae

Persona 1

Persona 2

Storyboards

Storyboard 1

Storyboard 2

Use Cases

MoSCoW Requirements

Functional Requirements

Non-functional Requirements