Project Background and Client

Thousands of people die from cancer every year in the UK. Although researchers work hard to find better and more effective treatments, millions of lives could be saved if research is conducted quicker and on a larger scale. Currently, most NHS institutions do not have access to a modern data analytics suite which would support automated anonymisation and analytics to be conducted on shared medical data. There is no suitable infrastructure to be used for activities such as capturing, transforming or transferring data across different systems. Due to the lack of a single, secure, open-source and an easy-to-use platform, existing applications for healthcare services are not widely used.

Working with our client, Dr Navin Ramachandran, who is a consultant radiologist at the University College London Hospitals (UCLH), we were tasked with developing a system which provides a platform which allows medical professionals and researchers to use. One which is able to anonymise sensitive patient information, carry out data science techniques and analytics before providing a customisable and interactive visualisation for users.

This project has been developed for PEACH, a large scale, open-source community with an aim to provide data science tools for medical professionals and researchers to aid them in diagnostic and analytics processes through the use of big data, machine learning and data visualisation.

Project Goals

This project has two primary goals. Whilst they may seem unrelated, both goals provide the core analytics system with methods to capture and visualise data, whilst also being able to be deployed as standalone products.

The first is to develop a data anonymiser tool, this is a piece of software designed to anonymise sensitive patient information for use in various areas of research and statistical records. This must be an application which is easily accessible to all medical professionals and researchers and must run on most platforms since it is unknown what operating system the user will be using.

The second primary goal is to further develop PEACH’s core analytics system by providing a visualisation solution. This visulisation solution must allow users to interact with, customise and view data which can be provided from the PEACH analytics system.

Requirements Gathering

Requirements for this project was gathered mainly through interviews and meetings with our client, Dr Ramachandran. Unfortunately, other information gathering techniques such as questionnaires and shadowing users was not possible due to difficulty scheduling time with users, the limited responses from questionnaires and privacy concerns.

Before the initial meeting with our client, we had prepared some questions to ask during the meeting and the questions and their answers are recorded below.

  1. What's the project about?
  2. This project is all about data analytics and setting up a system for medical institutions to be able to use advanced data science techniques to analyse lots of their data. There are currently no equivalent systems in place but this can be something that can really change research. Additionally, another part of the project is data anonymisation, this is a piece of software which anonymises patient information but can also be used to generate novel data from the provided information.

  3. What are the objectives of the project?
  4. The main objectives to this project are to implement additions to the current analytics infrastructure and to develop the anonymisation software. The current analytics system use technologies including Apache Kafka and Apache NiFi. Currently, the system receives and preprocesses data in NiFi before passing it onto Kafka which is a messaging hub. In terms of this analytics infrastructure, this project should aim to integrate technologies such as Kibana and Elasticsearch into the system. Another main objective is to develop the anonymisation software which can be used to anonymise datasets so that it can be used to analytics systems.

  5. Who are the users?
  6. There are two main groups of users for the core analytics system, and they are medical professionals and researchers who wish to access the analytics data through an interactive and graphical visualisation. The users of the anonymisation software will be doctors who have access to sensitive patient information, they will use this to remove sensitive information from data they have access to before uploading the data to systems such as the core analytics.

  7. Can we talk to the users?
  8. Unfortunately, the team will not be able to talk with users due to privacy concerns as well as difficulty scheduling time with them. However, should potential users be interested in the project and wish to reach out to the team, Dr Ramachandran would pass on contact details.

  9. How do you want to interact with the project? e.g) Mobile App, Web App
  10. The visualisation tool should be be available to users through the web so should be a web application. The anonymisation tool must support the many different operating systems and their different versions in use by the NHS today but could be either a desktop application or web application. However, sensitive information should never leave the user’s computer.

  11. Are there comparative products on the web already?
  12. There are some existing analytics suites and anonymisation software but they are mostly commercial or very basic, and so project is all about creating something which is free and suitable for medical institutions.

  13. How can we store the data? Is the data sensitive? Access rules?
  14. Data can be stored within the analytics system but all data must be non-sensitive. This means that the anonymisation software must provide enough privacy and users cannot be identified from the output. The visualisation aspect of the analytics infrastructure should have access rules to only allow authenticated users onto the system.

  15. Specific hosting required?
  16. Since Dr Ramachandran has a large number of credits Azure account, the team should use Microsoft Azure to host the analytics infrastructure.

  17. How to test?
  18. The analytics system should be tested using fake data, possibly from open-source datasets since it would not be possible to use real data due to privacy issues. This would be the same for the anonymisation tool but should a potential user be interested in running the software, then the team would be able have the software tested in a real environment with real data.

  19. Background knowledge (medicine) that we need to understand?
  20. Since the technical aspects of the project are all technologically related, there is no need for any medical knowledge to understand the project.

  21. Who will own the intellectual rights to the finished project?
  22. This project will be owned by PEACH and it will be up to the group who decides how to distribute the results of the project.

After the initial meeting, we analysed the information we had gathered and created an initial set of requirements using in the MosCoW style. A core part of our project involved extending the existing core analytics system and alongside understanding the infrastructure of the system by reading available documentation, we created personae, storyboards and use cases to recognise the needs of the user from their perspective. Examples of these are provided further down.

The requirements were later refined after research of competing solutions had been done, a better understanding of the project and its scope gained and further communication and discussion with the client carried out. Once changes had been made, they were confirmed with Dr Ramachandran as the final version, and we began the next phase of the project.

Personae

Persona 1

James Brown is a very experienced doctor with a long history in cancer treatment and rather recently has read into some technical topics including big data. Using this knowledge, he wishes to anonymise real cancer patient datasets and the data in a clear and effective way. He wants to interact with graphs and be able to have the system analyse and process the data for him so that it shows trends and patterns which could help with cancer treatment.

Persona 2

Mary Davis is a data scientist who is carrying out research into cancer patients, she knows that the NHS systems have huge amounts of data but she cannot access it easily. She wishes to access this data by asking medical professionals to anonymise datasets. She then aims to write some data mining tools and processes and visualise the output of the processes carried out the large amounts of data. By analysing and visualising the data, Mary hopes to benefit doctors by finding patterns in the information.

Storyboards
Storyboard 1

A doctor wishes to analyse how effective a new type of medication is, and wants to correctly advise their patients. The doctor has all this sensitive information to hand, but is unable to manually anonymise and filter through it as there is too much.

As a result, the anonymisation tool is used to automatically remove any identifying information from the data and the result is then gathered within the PEACH core analytics system. The system combines this data with other sources to accurately and quickly analyse large amounts of data and display it in a clear fashion. Finally, the doctor is able use this information to improve patients' conditions and get them up and running again!

Storyboard 2

Whilst visualising data, a medical professional is overwhelmed with too much information and too many graphs. The doctor wishes to only see the information which they require, and in simple graphs and displays. By using the interface provided by the core analytics system, the doctor is able to view and understand specific graphs and thus, allows accurate decisions to be made much more quickly.

Use Cases
Use Case Diagram
List of Use Cases
UC1 - Anonymise data
Description

A user wishes to anonymise sensitive medical data.

Primary Actor

Doctor/Consultant

System

Anonymiser System

Pre-conditions

User has opened the data anonymiser on their local machine and has switched to the anonymisation option.

Main Flow
  1. User selects an input file of type csv, json or xml
  2. System reads the input file
  3. System displays the list of columns, the data type and the anonymisation option
  4. User edits the settings for a single column or edits multiple column settings in bulk
  5. System opens a new window where the user can change the settings for the selected column(s)
  6. User switches between 3 data anonymisation types: leave as is, anonymise/randomise or remove
  7. User changes the data type for the selected column: name, gender, age, date, address, postcode, town, country, default
  8. For the selected data types, the user can select ranges
  9. System updates these settings
  10. User selects an output file in a chosen file format
  11. User selects the ‘Anonymise’ button
  12. System anonymises the data
  13. System saves the output in the output file
  14. System displays an information on loss percentage
Post-Conditions

The system resets the input fields and settings.

Alternative Flow

None

UC2 - Generate data
Description

A user wishes to generate test data using an existing dataset.

Primary Actor

Doctor/Consultant

System

Anonymiser System

Pre-conditions

User has opened the data anonymiser on their local machine and has switched to the generation option.

Main Flow
  1. User selects an input file of type csv, json or xml
  2. System reads the input file
  3. System displays the list of columns, the data type and the anonymisation option
  4. User edits the settings for a single column or edits multiple column settings in bulk
  5. System opens a new window where the user can change the settings for the selected column(s)
  6. User switches between 3 data anonymisation types: leave as is, randomise or remove
  7. User changes the data type for the selected column: name, gender, age, date, address, postcode, town, country, default
  8. For the selected data types, the user can select ranges
  9. System updates these settings
  10. User selects an output file in a chosen file format
  11. User selects the Generate button
  12. System generates novel data
  13. System saves the output in the output file
Post-Conditions

The system resets the input fields and settings.

Alternative Flow

None

UC3 - View raw data
Description

A user wishes to view raw data within the PEACH core analytics system through the visualiser.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Medical Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow
  1. User logs into the web interface
  2. User selects the dataset to view
  3. System displays a list of entries
  4. System displays keys and values for each entry
  5. User can filter and sort the raw data
Post-Conditions

None

Alternative Flow

UC8 ERROR

UC4 - Visualise data
Description

A user wishes to visualise the data within the PEACH core analytics system in graphical displays. The graphs and displays have already been created.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow
  1. User logs into the web interface
  2. User selects to visualise the data
  3. User selects the dataset to view
  4. User selects from the available displays
  5. User visualises the data in graphical displays
Post-Conditions

None

Alternative Flow

UC8 ERROR

UC5 - Interact with graphs
Description

A user wishes to visualise and interact with the data within the PEACH core analytics system.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface, logged in and opened up a visualisation of data. The frontend has successfully connected with the analytics infrastructure.

Main Flow
  1. User can change the axis settings and intervals
  2. User can alter the visible data displayed on the graphs
  3. User can select points on the graph and more information be displayed
Post-Conditions

None

Alternative Flow

UC8 ERROR

UC6 - Creating graphs
Description

A user wishes to create new visuals displays within the PEACH core analytics system.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow
  1. User logs into the web interface
  2. User selects to create new visualisation
  3. User selects the dataset to visualise
  4. User selects graph type from available options
  5. User sets intervals and axis settings
  6. User creates new graphical display
Post-Conditions

None

Alternative Flow

UC8 ERROR

UC7 - Write queries
Description

A user wishes to write queries to gain further insight into the data.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface and frontend has successfully connected with the analytics infrastructure.

Main Flow
  1. User logs into the web interface
  2. User enters query
  3. System searches for data based on query
  4. System displays output based on query
  5. User can view and interact with the data
Post-Conditions

None

Alternative Flow

UC8 ERROR

UC8 - ERROR
Description

There is a system error with the core analytics or the visualisation web application.

Primary Actor

Medical Professional/Researcher

Secondary Actor

Core Analytics Infrastructure

System

Visualisation System

Pre-conditions

User has opened the visualisation web interface.

Main Flow
  1. System detects an error
  2. System displays the error message on screen
Post-Conditions

None

Alternative Flow

None