Feb 2015 Hack Event - Part 2: The Importance of Health Informatics Data and Software, Prof Paul Burton, University of Bristol
Paul Burton discussed how academic health science should respond to the challenges of big data, and the potential role for BRISSKit in this space.
He began by emphasising that the term ‘big data’ is a misnomer, as the size of the data is only one part of the issue and can often obscure some of the other characteristics of the data that need to be considered. BRISSKit can play a role in addressing some of the areas that are being obscured.
To illustrate this point, Paul used the analogy of describing a city as a big village. He pointed out that the infrastructure and integration systems (both internal and external) required by a city are very different from those required by a village. Cities are complex, integrative systems that are qualitatively different from villages. They also happen to be larger than villages, but size is not the sole defining difference. Paul acknowledged that the accepted definition of ‘big data’ has evolved to reflect this, and now encompasses this issue of complexity, but this is not always fully communicated.
In a recent strategic review for the University of Bristol, Paul and his team concluded that universities responding to the real challenges of big data in the health sciences need to have:
• Good involvement in at least some methodological development, particularly in niche areas;
• An effective and sustainable support infrastructure, including facilities, equipment and staff for the leading applied research groups in the university;
• An adequate investment in an effective, up to date, multipurpose IT infrastructure across the university.
He discussed some of the ways in which the University of Bristol and the ALSPAC Longitudinal Birth Cohort Study he co-directs has been responding to these challenges, including the introduction of a university-wide institute for data science to help integrate existing efforts.
Paul argued that universities need to allow access to integrated data, including horizontally and vertically partitioned data. He suggested a data pipeline, including:
• Effective ways to generate data;
• Physical infrastructure to pull data together;
• Effective, well-governanced data access systems;
• Adequate security, appropriate for the level of problem;
• Quality assessment and data harmonisation;
• Valid, efficient and scaleable approaches to analysis;
• Effective visualisation approaches;
• Archiving and curation.
Each of these steps requires integration of a range of specialist activities. Paul advocated a bottom-up approach, allowing users to work with a range of different software and hardware, with funders performing a central enabling role by offering guidance and investment in the communications between different systems to allow greater flexibility for researchers to use appropriate functionality from different systems to suit their needs. This ‘plugin’ approach could go right across the data pipeline and gradually enhance it over time. This reflects the philosophy underpinning BRISSKit, with Jisc potentially playing the central enabling role.
To conclude, Paul gave a brief overview and demonstration of his team’s current project - DataSHIELD - which is incorporated in Opal. There are plans to integrate Opal into BRISSKit, so DataSHIELD will soon form part of the BRISSKit stack. DataSHIELD allows users to combine data from multiple sources and co-analyse it by taking the analysis to the data, rather than the data to the analysis. This provides a non-disclosive way of pulling together the data from multiple studies, which will add an extra dimension to BRISSKit.