After completing requirements gathering, we began with research of existing technologies and solutions. Firstly, starting with data anonymisation, we found various different software and libraries, some which are proprietary while others open-source. Each has its advantages and disadvantages and whilst considering the requirements for tool, we compared these existing pieces of software. Alongside considering the requirements for the solution, researched technologies were differentiated by accessing how open and structured the code is, how it would fit within our solution and the effectiveness in applying advanced techniques to anonymise data.
Whilst researching existing anonymisation tools, we were also finding technologies related to data visualisation. There are various open-source and commercial software programs that could be integrated into the core analytics infrastructure. Between these, we made decisions on which allowed for enough flexibility and had features required by the user as well as a provided a user interface tailored for the target audience, researchers and medical professionals.
The most commonly used tool for data anonymisation is ARX, this is a Java application and library which provides comprehensive data anonymisation capabilities to any Java programs [1]. Since the Java tool provided by ARX is very complex and allows for extensive and advanced anonymisation, which would not have met the easy-to-use requirements, it was decided that we would only consider the ARX library for data anonymisation. The ARX library used various different privacy models depending on the privacy threats and data provided. Examples of the supported privacy models are k-Anonymity, k-Map, t-Closeness, ℓ-Diversity, δ-Disclosure privacy, β-Likeness and δ-Presence [1]. The ARX library provides a clean API that delivers these privacy models to programmers.
The ARX library is advantageous because it can handle very large datasets on commodity hardware and provides methods for analysing data utility and identifying potential risks [3]. This would be extremely beneficial when anonymising huge datasets of sensitive medical information which must be carefully handled. Next, another advantage of ARX is that it can carry out data cleansing [1]. By filling in missing data using averages and fixing data that is in the wrong format is especially useful, and even more so when considering hospital data and how unstructured it may sometimes be. Finally, not only does the ARX library support a plethora of advanced statistical models, the library is also open-source [2]. This means that the software can be adapted based on the needs of this project.
However, the ARX library does also have some disadvantages, and they are that because the library is so extensive, it is difficult to allow users to select an anonymisation models without making the tool difficult to use. Nevertheless, this problem could be solved by carefully analysing the different use cases of the resulting software and designing the tool to algorithmically select privacy models for each of those situations. This would allow the tool to have a simple user interface which provides a good user experience, whilst also meeting all the functional requirements. Additionally, as is with most software, the technical documentation is a little lacking, and will involve more work to be required by the programmer.
Imperva Camouflage Data MaskingCamouflage Data Masking is a properiory application which requires Java to be run, the user interacts with the software through a web browser. Camouflage Data Masking is a piece of software which reduces sensitive data by locating and categorising data in databases before applying heuristics and statistical analysis to classifications [3]. Not only are databases supported, but also, there are various supported file formats. With files, the software detects the format of the input and uses algorithms to shuffle or substitute characters, and generate random data in order to keep the integrity of results while removing any identifying information.
Camouflage Data Masking is advantageous because it is software built specifically for the purpose of anonymising datasets, it also supports many types of databases and has a large number of supported file formats [3].
However, there are various disadvantages of using Camouflage Data Masking. It does not allow the user to have much control in the anonymisation process and does not allow specific anonymisation models to be applied. Additionally, since the application is commercial, it must be purchased from Imperva which would become very expensive to run on many machines. The code is also not open-source, so the package cannot be developed to meet the specific needs of the user.
UTD Anonymization ToolBoxThe UTD Anonymization ToolBox is a toolbox intended for academic researchers in the area of privacy preserving data analysis [4]. The toolbox supports six different anonymisation methods over three different privacy definitions including Datafly, Mondrian Multidimensional k-Anonymity, Incognito, Incognito with l-diversity, Incognito with t-closeness, Anatomy [4]. Written in Java, alongside supported file formats, a SQLite database can also be used as a store of data for the toolbox.
This UTD Anonymization ToolBox has advantages including the open-source distribution of the code and so, means that extensions can easily be added and our anonymisation solution can be developed whilst easily incorporate algorithms from the toolbox. Additionally, this toolbox has been well-tested amongst the academic research field and so, would likely carry out anonymisation which meets standards and completely removes any identifying information.
However, the toolbox is rather limited compared to the ARX library, it doesn’t support as many different privacy models and is also released with very little documentation. This would make it difficult to use and understanding the interfaces when interacting with the toolbox. As a result, it would require further research to use the application should this toolbox be chosen to support our data anonymisation tool.
Cornell Anonymization ToolkitThe Cornell Anonymization Toolkit is designed for interactively anonymising published datasets to limit the identification disclosure of records under various attacker models [5]. Designed for Windows, although this software has a simple user interface, it lacks in statistical privacy models and so, we believe it cannot guarantee the privacy of users.
Other Anonymisation ToolsThere are some other anonymisation tools available, but they focus on anonymising specific formats rather than a dataset.
Anonymizer is an anonymisation tool with the sole purpose of blurring identifying information within an image [6]. An example of this is a street-view picture where any faces or number plates are blurred. The software takes images of various formats and allows the user to conceal identifying information.
Another function specific anonymisation tool is called NLM Scrubber. This tool takes documents as input and removes any identifying information such as names, dates and IDs, and is particularly useful when desensitising medical documents [7].
Programming Languages RThe R Programming language is used primarily for statistical computing and data analysis [8]. It consists of many features allowing data scientists to create statistical models using linear and non-linear techniques such as regression, classification and clustering. This would allow a application written in R to be easily manipulate data for anonymisation. However, there are not many anonymisation libraries which can be used for data anonymisation in R and additionally, since R does not have libraries for user interfaces in R, other languages will be required to wrap the R code which add more implementation to be carried out.
PythonPython is a powerful language which has many libraries, with a relatively simple syntax, it will be easy to develop an application in the language. It is an interpreted language [9] but again, there are not many data anonymisation specific libraries which can be used. Like with R, this would require more implementation to use another anonymisation package written in a different language. It is not recommended that we implement our own anonymisation algorithms since we cannot test and ensure patients’ privacy.
JavaScript and ReactJavaScript is an interpreted language which is run on most web browsers [10], and React is a JavaScript framework for building user interfaces [11]. Advantages of a data anonymisation tool written in JavaScript is that the tool is very easily accessible by any computers. However, JavaScript does not have many anonymisation libraries and it will be difficult to guarantee that sensitive data never leaves the user’s computer.
JavaJava is an object-oriented programming language with many data anonymisation libraries, and using the Java Virtual Machine, Java applications can easily be run on different platforms [12]. Using the same executable, the same jar file can be run on most platforms. This not only meets the requirements but also, due to the availability of many different libraries including libraries for building user interfaces, it would be suitable to create the data anonymisation tool in Java.
Kibana is an open-source visualisation plugin for the Elastic Stack. It allows users to create many different graphs and charts using large volumes of data and it could use machine learning to find anomalies in data [13].
An advantage of Kibana is that it allows users to share their dashboards and embed them on the web. It also allows exporting generated insights to PDF and CSV files. Another advantage of Kibana is that it is open-source and free [14]. This means that it could be adapted to meet specific needs of the users as well as having low running costs. A disadvantage of Kibana is that it requires a webserver to run and factors like the number of users and traffic will determine the cost of running those servers.
TableauTableau is a visualisation tool very commonly used by data scientists. It allows users to input a dataset and analyse the desired attributes using various graphs and charts. The aim of tableau is to allow users to gain effective insights and patterns quickly using the visualisation created with their dataset. Tableau allows users to output the visualisations they created.
The greatest advantage of Tableau is that it could be accessed on any platform using a web browser so users will not have to go through any trouble to get the system up and running. Another advantage of Tableau is that it is widely used by data analytics professionals and it provides many options for the purpose of providing useful insights [15]. A disadvantage of Tableau is that the software is costly and it would cost a large amount of money to have it running across the NHS. Another disadvantage is that it is complicated to use and users will have to be trained extensively to use the software effectively. The program is not open-source so it cannot be adapted to meet requirements specifically for the users.
Power-BIPower-BI is an effective business analytics suite provided by Microsoft and it allows users to easily visualise data and create reports and dashboards [16]. The software accepts data of many formats and provides the user simple drag-and-drop options to create visualisations based on the data.
An advantage of Power-BI is that minimal training costs will be required because it is easy to use. Also, Power-BI makes it easy to share and view visualisations and insights on mobile devices and the web using Power-BI service [17]. The downside of Power-BI is that the visualisations and data analysis can only be carried out on Power-BI desktop - which is only available on Windows. This means that only people with a Windows operating system can use the software. Another disadvantage of Power-BI is that it is expensive to run across many users.
Custom VisualisationIt would also be possible to create a custom visualisation tool designed specifically for users in JavaScript and using frameworks such as React. This would allow the solution to be designed with the user in mind, meeting all their functional and non-functional requirements and would provide them with a good user experience. However, designing a visualisation tool from scratch will require a longer implementation and design process which could be better spent enhancing existing visualisation applications. Finally, creating a custom application may result in a final solution which although is tailored to the user’s needs, it may not have as many features as existing visualisation technologies.
After carrying out research of existing solutions, for data anonymisation, we decided on applying the Java ARX library into our project. The ARX library is open-source, includes many privacy models which meet anonymisation standards and is widely used.
The biggest advantage is that the existing tool at PEACH applies basic aspects of the library so it would be suitable to use it to further develop this software. Furthermore, since the anonymisation tool has to run locally on computers, written in Java, this solution is able to meet the requirements. Although it does have all this functionality, it does have a complex UI which would require training and advanced knowledge of anonymisation to use effectively. We have decided to rewrite the user interface to provide one designed specifically for the target audience.
We decided against using the other related technologies because the ARX library is more advanced and provides more anonymisation algorithms but also because the Camouflage Data Masking because it would be expensive to run due to licensing costs. The Anonymizer and NLM Scrubber will also not be a part of our anonymisation solution because they have specific functions which do not meet the needs of the user.
With the core analytics, considering the existing solution with Apache Kafka, NiFi and Spark, we decided that Kibana and Elastic Stack would provide the most effective data visualisation tool. Not only did our client suggest this technology stack, but this piece of software is both open-source and the result has a less steep learning curve compared to other existing technologies. Also, Kibana can be run from any platform and easily allows users to share their dashboard. The greatest advantage of Kibana is that it is designed to allow users to visualise large volumes of data so it seemed to be the most appropriate choice for our system.
We chose not to use Power-BI mainly because the visualisation suite can only be run on the Windows operating system. This could be an issue if any user doesn’t have the required operating system. We decided against Tableau because it is difficult to use and it would require users to be trained to use the software which would make it expensive. A disadvantage of both Tableau and Power-BI is that they are both costly software and it would be expensive to run them on multiple machines across the NHS. Kibana is free, open-source software that is accessible on all platforms as it runs on the web and it is easy to use.