What’s Keeping Your Enterprise from Fully Leveraging Its Data?


Jim Olson-CEO, Flywheel

Research & development (R&D) processes are increasingly being redefined within the framework of digital transformation. With these initiatives, pharma organizations aim to improve speed to market and reduce the considerable R&D costs of drug discovery and development.

Artificial intelligence (AI) and machine learning (ML) are key parts of this transformation. However, organizations are discovering that harnessing these approaches isn’t the work of just a few weeks or months. Some companies have gone through several iterations of digital transformation in R&D, with costly failures along the way. A recurring issue among these iterations has been the problem of how to efficiently de-silo data and enable enterprise-wide access.

Machine learning is data-hungry (especially for diverse data), and unless research enterprises have outstanding data governance models in place, they face challenges in realizing a true digital transformation. Often, the challenge is in breaking data out of institutional silos, uniting them in a central repository (or a centrally managed and normalized set of repositories), and standardizing them for machine learning.

Decades of ingrained culture, pieced-together tech stacks, and homegrown systems make strong headwinds for life science leaders as they seek to move their research teams forward. But the risks of the status quo are impossible to ignore: without modern data management, organizations are wasting money, missing innovative opportunities, under-leveraging their assets, and potentially even facing compliance risks.

Why Is Standardizing Data Such a Challenge?

It doesn’t take long to understand at a high level why data management is so difficult for pharma companies. The data they hold are old and new, simple and complex, and sourced from multiple places, ranging from internal or external research results to clinical trials that are still underway. In addition to what is located in a company’s own archives, data can be held externally by development partners such as contract research organizations (CROs) and clinical sites.

To explore just one example of the difficulty in standardizing even newly captured data, consider a global clinical trial with hundreds of patients that has prescribed an imaging protocol with five types of examinations for every patient at multiple time points. With imaging being performed via multiple sites, devices, providers, and languages, this trial could generate thousands of unique metadata tags and descriptions.

As this example illustrates, the sum of an organization’s data, having been captured over time by different researchers, different devices, and using different organizational conventions, results in a collection of heterogeneous data with untold variation, even if data is provided in the same format (e.g., DICOM). The older the data, the more challenges may arise when attempting to curate those data. Over a dataset’s lifecycle, as the data have moved through transfer and analysis pipelines and changed format, the likelihood that metadata have been manipulated or even removed increases. Significant portions of the data will likely predate AI/ML and will not have been acquired, archived or organized with such applications in mind. However, as already mentioned, ML is data-hungry, and even legacy data or data that are part of long-term longitudinal studies with issues like these are worth curating to create bigger datasets for training.

Traditionally, in order to harness disparate data for ML, research teams must spend countless hours locating datasets and manually curating them to a common standard. This curation entails tasks including selection, classification, transformation, validation, and preservation of research data and supporting material. The time required by this manual curation process dramatically increases costs and time needed for discovery and development of therapeutics. Failure to implement a standardized process can even set up the enterprise to waste money again and again, finding and curating the same data repeatedly.

To avoid these problems, life science organizations must adopt strategies to efficiently curate data and associated metadata (in all forms) upon ingestion to a central repository. Without an established and diligent approach, researchers cannot efficiently leverage the enterprise’s assets, and data that should be the fuel for development instead becomes a stumbling block.

Applying FAIR Principles to Complex Data

The data management framework that the ML research community has embraced in recent years is FAIR Data—meaning digital assets should be Findable, Accessible, Interoperable and Reusable. While the FAIR principles are more widely applied in academia and in clinical research settings (where data sharing is expected and required), they are also applicable within the walls of life science companies to make data more valuable across the enterprise.

Forward-thinking life science companies are using these guidelines to not only help combat the data management issues they are facing, but also to maximize the potential of their data. If applied correctly, the FAIR principles can advance research in pharma companies by reducing R&D work and costs, bringing operational efficiencies, and eventually accelerating time to market.

Of course, applying the FAIR principles is enough of a challenge when dealing with tabular data—FAIRifying the medical imaging data required for many ML efforts is an even bigger hurdle. Medical imaging is a rich and valuable source of information for researchers, but because of its large size and complex nature, these data pose one of the biggest challenges to organizations that are seeking to de-silo and FAIRify data.

As an example, consider DICOM files. The DICOM format itself provides some level of standardization, but significant variations still exist between modalities (MR, CT, PET, etc.), vendor instrumentation (Siemens, GE, Philips, etc.), acquisition type, and specific site. Such differences must be reconciled before the data can be used in assessment/analysis approaches, including ML.

Additional challenges standing in the way of leveraging imaging data include the sheer size, as data sets can be in the range of gigabytes per study, terabytes over a cohort, and petabytes in legacy systems. Furthermore, labeling the data often requires qualitative assessments by scientists and radiologists. Setting up workflows and data capture mechanisms for this work can add more labor to an already intensive effort. Moreover, imaging data oftentimes need to be integrated with other data types (i.e., clinical measures). Collectively, these factors help illustrate why application of the FAIR principles presents a high hurdle—even to very motivated research teams.

A Realistic Approach for De-Siloing and FAIRifying

Life science enterprises are coming to terms with the fact that manually locating datasets and curating them at the scale required for ML is cost-prohibitive, time-consuming, and prone to human error. As discussed, data and metadata from internal and external sources needs to be curated, including standardization, classification, and quality control, to ensure that it meets the standard required for complex analysis and ML workflows. As more data is introduced over time, these processes must be easily repeated within and across datasets, and scale with the size of the data coming in.

In attempts to streamline some of this work, many organizations have created homegrown infrastructure for select purposes. However, these systems are typically built for very specific tasks and require expertise from IT departments and/or data scientists to create and maintain. Such programs can be difficult to train staff on, and the institutional knowledge on how to run them may reside with just a handful of people, which puts such operations at risk when teams are reorganized or key members depart.

Many organizations are discovering that a better alternative is a modern data management platform, which can automate much of the work of ingesting data—even complex imaging data—and curating it to FAIR standards. The automation of these processes is key, as it reduces human error and variability. In addition, automation also promotes adoption of FAIRification as a realistic and achievable goal that doesn’t require an indefinite all-hands-on-deck effort from data scientists.

A modern data management platform can leverage cloud scalability and parallelism to achieve many goals at once: It reduces upload time while handling data from both old and new sources. It automatically de-identifies data, extracts metadata, and classifies data per needs. It then builds an easily searchable collection of data. In summary, it prepares the data for downstream processes in a standardized and efficient way.

With this type of extensible data platform, the FAIR principles can become more concrete within an organization:

  • Findability is dramatically improved, as researchers can build their complete dataset with simple queries within one interface. Without this approach, researchers commonly must request data from CROs, external collaborators, or an internal archive, which can take days or weeks, and often comes with additional costs (which are paid repeatedly if multiple divisions request the same data from the vendor).
  • Accessibility is handled with role-based user permissions, which can be configured to give individuals the appropriate level of access to data, and user roles that enable which actions they can take with data. Thinking of accessibility in another way, data access can also be improved between the enterprise and an external collaborator or CRO. With the right configuration, new data from an external collaborator or CRO can appear in the organization’s catalog as it is captured, instead of researchers having to wait for the data to be delivered in a package at designated time points.
  • Interoperability is simplified with APIs and tools for working with the data, including export, analysis and integration with other data types.
  • Reusability is enabled by letting teams repurpose the same data for analysis and large-scale processing across the enterprise, which is easily possible if the data and associated metadata are well-described and stored.

Aside from the efficiency benefits that can be achieved from de-siloing data, organizations should also consider how modern data management can simplify compliance and regulatory approvals. First, de-identification tools can be configured to remove personal health information (PHI) and personal identifiable information (PII) “on the edge,” before it comes into the platform and is made accessible to researchers, ensuring consistency across departments. Additionally, audit trails can be easily captured that show access logs, versioning, and processing actions.

With a centralized, standardized and scalable infrastructure that performs these functions, pharma research teams can enable FAIR-driven processes, meet their greater R&D goals, and plan for more ambitious data-driven objectives in the future. The ability to efficiently and securely access massive amounts of curated data on the cloud repeatedly can enable research that wasn’t possible in the past. These tools allow R&D groups to meet project deadlines focused on processing large, complex objects with varying data formats in a standardized and reproducible (and repeatable) manner while not being burdened with manual data management.

By combining a sophisticated approach to data management with well-thought-out goals for complex analysis and/or AI/ML, organizations can avoid the pitfalls that have hindered digital transformations thus far. Scientists and researchers can focus on analyzing and processing datasets, rather than managing them—resulting in accelerated R&D and innovation to bring therapeutics to market faster.

Author Biography

Jim Olson is CEO of Flywheel, a biomedical research informatics platform. The company uses cloud-scale computing infrastructure to address the increasing complexity of modern computational science and machine learning. Jim is a “builder” at his core. His passion is developing teams and growing companies. Jim has over 35 years of leadership experience in technology, digital product development, business strategy, high-growth companies, and healthcare. He has worked for large and startup companies, including West Publishing, now Thomson Reuters, Iconoculture, Livio Health Group and Stella/ Blue Cross Blue Shield of Minnesota.

 

Subscribe to our e-Newsletters
Stay up to date with the latest news, articles, and events. Plus, get special
offers from American Pharmaceutical Review delivered to your inbox!
Sign up now!

  • <<
  • >>

Join the Discussion