Informatics in the Analytical Laboratory: Vision for a New Decade


Wide scale adoption of autosamplers attached to analytical instruments enabled a step-change increase in productivity in the laboratory. For the first time, analytical chemists could generate high quality data without being in the lab continuously. Today, we take this capability for granted and it’s difficult to imagine manually introducing every sample to an instrument.

If we fast forward several decades we are on the verge of a similar transformation in laboratory efficiency and capability. Outside the laboratory, the last decade has seen incredible advances in widely accessible technologies. Our lives have been changed by smart phones, laptop and tablet computers, wireless connections, and touch-screen devices, along with social networking, blogs, wikis and on-demand access to books, music, and movies. Providing a new capability is not enough. The hallmark of a successful new product is its simple, intuitive, and responsive user-interface. No user manual is required. Although complex behind the scenes, the user-experience is easy and sometimes even fun. There are now cell phone applications that allow the user to take a picture of a wine label and the application displays a map of the local area with pins marking stores that sell that exact bottle of wine, with prices marked and store hours displayed. Extraordinarily complex algorithms and fast computation are employed, yet using the application is easy.

Now, take the case of the modern analytical laboratory. Laboratory software is powerful, but many applications are complex and require extensive training or dedicated expert users. Individual software packages often contain full functionality, trying to cater to every conceivable future need. Software applications and instrument hardware from different vendors don’t work together. In many cases, proprietary data formats are still prevalent and cannot be used by other applications. This “big box” software and lack of cooperation between applications and instruments means lost productivity in the lab. Our time is frittered away searching for data that’s difficult to find or “cleaning” and transforming incorrect data. It often takes 20 or 30 minutes to do things that, in principle, could take a few seconds. Each manual step repeated by many individuals over time increases costs. All of this leads to financial costs that are difficult to quantify, but nevertheless very real.

This complexity needs to change. This first paper of a two-part series provides qualitative concepts and ideas about potential improvements in efficiency and capability in the modern analytical laboratory. The second paper will focus on the technical feasibility, the business risks and benefits of adopting this software model and strategies for realizing these concepts.

Making it Easy Saves Time

The overarching principle is to bring easy-to-use, modular software applications, so common in our personal computing world, into the pharmaceutical laboratory. All of us have adapted to easy and powerful personal software; we now expect it from laboratory software [1]. Figure 1 contains some personal computing principles applied to the laboratory environment.

Let’s illustrate the principles shown in Figure 1 with just one simple example. I need to run a content uniformity study on a batch of immediate release tablets using HPLC. My samples are prepared and ready for analysis, but the HPLC I usually use is unavailable. I’ve not yet used this specific analytical method, but there is an approved analytical method document in our database. From a computer at my desk or in the lab, perhaps even a handheld tablet computer, I type a few “filter” words in a search text box to find the version-controlled analytical method document. Like searching for a song in the iTunes music store, my library of hundreds or thousands of analytical methods is winnowed down to the exact method I need. I simply drag an icon representing the analytical method onto an icon representing the exact HPLC instrument I am using. The instrument icon accepts the analytical method, and asks if I am ready to run samples. Having placed prepared samples on the instrument I click or tap “yes” and drag a sample tray icon representing my samples onto the instrument. The application then prompts me to visually confirm that the HPLC is configured with the correct mobile phase and then confirms that there is sufficient volume of mobile phase to complete the analysis. Samples are then automatically queued for analysis and I walk away: I’m done.

This scenario is markedly faster and simpler than the current state-of-the- art (e.g., no transcription). If during the analysis any measurement from the instrument (e.g., pump pressure) was not consistent with predefined limits, an email could be sent, warning that the instrument was not operating under optimal conditions. The email could contain a suggested cause of the problem and current instructions on how to fix it. Upon successful completion of the run, a report could be automatically generated, summarizing the results in a format that can be provided to other colleagues.

Well-Defined, Accessible, and Extensible Information Flow

How is this easy workflow possible? Behind the scenes, the analytical method document contained instructions for implementing the method on any instrument, regardless of vendor or model, including instrument parameters and instructions on how to process the chromatograms and report the results. If the instrument configuration was different than that specified in the method, the user would be prompted to confirm various changes in the configuration, perhaps with default instrument parameters set according to business rules. Likewise, information about samples (e.g., identity, composition, chemical structures, concentration) also “flowed” through the system and onto the instrument, enabling the analysis. Data integrity is ensured because manual transcription is removed. Although this example is specific to HPLC, the scenario could be enabled for any analytical technique.

This smooth “flow” of information – both data and instructions – is the critical piece required for simplicity and enhanced productivity [2]. It’s the key piece missing in today’s analytical laboratory. We detail below how a coordinated information flow might work. Figure 2 shows the chronological, sequential flow of data and instructions required for every analysis. Although each experiment differs in the amount of time required to complete each unit process (box), absolutely every analysis requires the same flow of data and instructions: from the simplest single-shot analysis to the most complex multivariate optimization or robustness study.

The information flow shown in Figure 2 provides a conceptual framework to think about how we can use laboratory software to realize the principles shown in Figure 1. Data and instructions flow through each unit process (box), mediated through commonly agreed, standardized interfaces (red boxes with arrows connecting the unit processes). While globally accepted vendor-neutral standards for each interface are certainly desirable, they are not strictly required. However, information at each interface must be well defined, accessible, and easily extensible at all points. As discussed in the second paper, implementing these interfaces using XML technologies satisfies all requirements. Software to execute unit processes represented by purple boxes can be supplied by different vendors, while instrument control (green box) should be restricted to the hardware vendor. This conformance to a set of interfaces means that data can enter at various unit processes and pass safely through to the reporting of results: it is never transcribed manually, the same data is entered manually only once, and it is always validated upon entry. The dashed arrows represent optional unit processes discussed below. Wrapping the entire framework could be an application that documents and monitors nearly every part of the process, e.g., an electronic notebook in an R&D environment. Supporting this entire flow are four functions that interrogate data and instructions contained at each interface, extracting only the meaningful pieces of information for a specific need. One function monitors live events (something happened, respond accordingly). A second function responds to new data generated in the flow and stores it appropriately in a pre-defined schema. A third function provides powerful search capability for all data that has ever passed through the flow, either graphically or by text entry. Finally, the fourth function applies configurable business rules at various points in the flow. In the previous example of a content uniformity experiment, the business rules might have specified the number of significant figures in the automatically generated report, based on an SOP or other guidance document.

Plug-and-play: a Modular Approach Enables Different rates of Change

Scientists are accustomed to modular instrument hardware. A new detector is easily swapped with the old module, while keeping the remaining instrument stack unchanged. This modular mindset allows for differential rates of change. We don’t need to wait for all instrument modules to improve before making an incremental improvement in just one module [3]. Furthermore, laboratories are never homogeneous with respect to instrument hardware; it should be possible to link software from one vendor with software from a different vendor. The modular approach advocated here addresses these needs.

 Viewed from the context of Figure 2, and reflecting on the smart phone application concept, laboratory software breaks down logically into just a few decoupled, modular application groups, each of which does one or two things well and all of which communicate seamlessly with each other. The Microsoft Office suite of applications is an appropriate example. Microsoft Excel is a powerful spreadsheet application, but it’s not very good at creating documents. Yet a table can be pasted from an Excel worksheet into a Word document or Power Point presentation. This paper was written using the Microsoft Word word-processing program, the figures were created using Microsoft Visio diagramming software, and the entire document was converted to PDF using a third party plug-in before being sent to the publisher. This is the level of communication between applications we seek.

Figure 3 is a translation of the unit processes and information flow depicted in Figure 2. It shows a logical grouping of software applications (purple, green) based on the data and instructions that each application produces (red).

Each of the five application groups contains at least one application that produces a well-defined and scope-limited collection of data and instructions. All applications should be ‘plug-and-play’, meaning each application could be from the same or different vendors. However, as discussed above, instrument control, acquiring and recording the raw data (green) is best implemented by the instrument vendor, based on instructions received from Application Group 2. This strict separation of instrument instructions from execution supports the software hardware compatibility that modern laboratories require. Decoupling the entire architecture into modular applications means that new capabilities can be added at different rates, a key requirement for continuously improving laboratory processes.

Efficiency: Point-and-Click Reports

The software application groups outlined in Figure 3 support daily, routine laboratory work by providing rich, graphical user-interfaces to enable single-purpose activities. However, additional data access needs often span large timescales or cross many data dimensions. Consider the travel website that summarizes trip options across multiple airlines and hotels, with just a few mouse clicks. Pharmaceutical laboratories require similar kinds of data aggregation from disparate sources. For example, a single-purpose web page could access the data shown in Figures 2 and 3 to provide a suite of reports. Examples include reports for certificates of analysis, batch impurities, drug substance and drug product stability, instrument inventory, computer inventory, chromatography column management, audit trails, instrument utilization metrics, security, sample metadata, analytical method metadata, user metadata, instrument maintenance history, data reprocessing history, out-of-specification and out-of-tolerance histories, etc. Creating reports like these that contain disparate data is time consuming and sometimes impossible. Point-and-click access to complex reports is timesaving and efficient.

More than Efficiency: New Capabilities

The modular software approach also affords several new capabilities. “Theory guides, experiment decides” the adage goes. There are many unmet needs for using theory and models in concert with experimentation. How often do we see data processing algorithms or predictive models in a peer-reviewed journal and think, “Hmm… that’s interesting, but I’ll never apply that to my own data. I don’t have the time, skill or resources to implement it.” Fortunately, computational advances made in the last decade have made it possible to change this mindset. Now, “average” users can access powerful and sophisticated computations at the push of a button. Well-established and novel computational algorithms from academic institutions can be brought into the industrial laboratory. Using the plug-and-play approach, there are at least three broad classes of computations that can be made available: advanced signal processing, predictive modeling, and optimization algorithms. While there are off-the-shelf solutions for some of these problems, they require expert users and they do not work with all vendor instruments. Examples are provided below.

Signal Processing for Routine Data Analysis

The vast majority of data generated in an analytical laboratory is first and second order [4]: a vector or matrix containing thousands of digitized responses measured at a constant increment (i.e., a sampled data series). Chromatography, NMR, IR, Raman, XRD, NIR, and UVV is are just a few examples. This common data type lends itself to well-understood signal processing algorithms [5]: Fourier transform, autocorrelation, smoothing, filtering, de-noising, ensemble averaging, spectral summation, background subtraction, peak finding, intelligent peak labeling, stochastic resonance, wavelet decomposition, peak picking, baseline removal algorithms, etc. The framework shown in Figure 2 supports a “toolbox” approach to signal processing on any first- or second-order data type, regardless of the analytical technique. Grab an icon representing an ensemble average from the toolbox, drop it on a collection of chromatograms obtained from an ultra-fast UHPLC-UV run and see if it improves the signal-to-noise ratio of a low level peak (or an automated equivalent of this). Use the same toolbox approach to smooth an LC-MS TIC, apply a Kalman filter to unresolved chromatographic peaks, or use wavelets to de-noise an NMR spectrum [6]. The shared data type supports this level of plug-and-play signal processing across many of these analytical techniques from the same, single-purpose application. With the exception of processing NMR raw FID data which contains both real and imaginary values, many of the raw data from different analytical techniques can benefit from this ‘cross-fertilization’ of signal processing algorithms.

Modeling in the Laboratory

Structure-property relationships are often used in research labs in our industry. It’s routine for the biological activity of millions of compounds to be predicted in silico. Given a molecule’s structure, the computer can suggest the molecule’s activity in a biological assay before the molecule is ever tested. The same opportunity exists in the analytical laboratory: predict a chromatographic response based on a molecule’s structure before running the sample. Alternatively, run a few experiments and other computer models can predict chromatographic responses under a variety of conditions based on well-established formulae. Like the signal processing toolbox approach, a modeling toolbox would enable predictions before extensive experimentation. Using in silico predictions to guide experiments can yield significant cost savings, provided we use the models properly. Push-button access to predictions both from semi-empirical and machine learning modeling approaches will save significant time, effort, solvents, and materials.

Optimization in the Laboratory

Optimization problems are arguably the most expensive and time consuming activity in the analytical laboratory. Often called “large search space” problems, they are characterized by a known or desired output, but an inordinately large number of inputs. Chromatographic method development is a quintessential example: we know that an optimal chromatographic method is one that separates all compounds in a mixture with acceptable resolution in the minimum amount of time. Unfortunately, there are too many inputs: columns, mobile phases, temperature, pH, gradient profiles, etc.  Computer predictions coupled with automated feedback loops offer unprecedented opportunities (see optional loop in Figure 2) [7]. For example, place racemate samples on a chiral screening UHPLC with several columns and mobile phases, click a few buttons to define the objective and constraints of the separation, employ a genetic algorithm to produce predictions of the best conditions to try next and automatically iteratively converge to find a set of optimal methods. Like the signal processing and modeling toolboxes, there are numerous other optimization algorithms to explore on real problems. Additionally, simple rules-based criteria can support automated workflows, like making blank injections on an LC until a suitably clean baseline is obtained prior to injecting samples. Automated push-button optimization of challenging analytical problems is an unmet need in the modern analytical laboratory and it is also well supported by the framework discussed here.

A Concrete Example

We have implemented some, but not all of the capabilities described here in the context of a collection of 120 non-GMP, open-access analysis chromatography instruments [8]. For example, WKY and JS cited in [8] have developed two useful, modular applications. One is used to submit samples for analysis and indirectly control an HPLC instrument in an open-access mode outside the traditional CDS. Another application is used to remotely access, visualize and trend decentralized data from integrated chromatograms stored in a vendor-neutral format (XML). In-process results show up automatically for review upon completion. Historical results can be found quickly from any instrument in our worldwide collection and can be overlaid in seconds with a few button clicks, allowing rich interaction with the data. The applications are used daily by hundreds of scientists primarily because they are intuitive, responsive, and require no training in most cases. These are concrete examples of the modular, single-purpose applications working in concert with other applications as depicted in Figure 3, leaving instrument control and data acquisition to the instrument vendor software. We developed these custom solutions because there were no vendor solutions that were intuitive, powerful, and scalable across our global instrument fleet. However, we cannot and should not develop these solutions in isolation: we need off-the-shelf solutions from vendors. In a second paper in the January / February 2011 issue, we will provide a detailed, flexible architecture that can be used to implement the ideas described in this paper. Topics will include discussion of extensible architectures, vendor-neutral standards, 21 CFR Part 11 compliance and validation, financial incentives and risks, and a concrete plan for rapid vendor implementation. We look forward to an honest discussion, collaboration, and criticism of this paper. Please contact us.


The authors thank Dr. John Hollerton for insightful discussions about this topic.


1.  Schwartz, B., The Paradox of Choice: Why More is Less. Chapter 8: Why Decisions Disappoint: The Problem of Adaptation. Harper Perennial, 2004.

2. Womack, J.P., Jones, D.T., Lean Thinking. Chapter 3: Flow. Simon & Schuster UK Ltd, 2003.

3. Bean, M.F., Jin, Q.K., Carr, S.A., Hemling, M.E., An Architecture for Instrument-Independent LCMS Data Processing with Web-Based Review and Revision of Results, Proceedings of the 49th Annual Conference of the American Society for Mass Spectrometry, 2001.

4. Booksh, K.S., Kowalski, B.R., Theory of Analytical Chemistry. Anal. Chem., 1994, 66, 15: p. 782A – 791A. 5. Brereton, R.G., Chemometrics: Data Analysis for the Laboratory and Chemical Plant. Chapter 3: Signal Processing, John Wiley & Sons, 2003.

6. Hastie, T., Tibshirani, R., Friedman, J., Elements of Statistical Learning, 2nd Edition, Springer- Verlag, 2008. p. 174 – 181.

7. O’Hagan, S., Dunn, W.B., Brown, M., Knowles, J.D., Kell, D.B., Closed-Loop, Multiobjective Optimization of Analytical Instrumentation: Gas Chromatography/Time-of-Flight Mass Spectrometry of the Metabolomes of Human Serum and of Yeast Fermentations, Anal. Chem., 2005, 77: p. 290 – 303.

8. Roberts, J. M., Cole, S. R., Spadie, J., Weston, H. E., Young, W. K., American Pharmaceutical Review, Apr March 2010. p. 38 – 44.

Author Biographies

Dr. James M. Roberts is an Investigator in Analytical Sciences at GlaxoSmithKline. He works on Platforms and applies informatics, data integration, modeling, and automation to increase efficiency in analytical chemistry. He received a Ph.D. under the advisement of Professor Janet Osteryoung and has 12 years of experience in the pharmaceutical industry.

Dr. Mark F. Bean, Investigator at GlaxoSmithKline, has worked with automated MS and software solutions for research chemistry LCMS for 20 years. He was the architect-project lead for CANDI, a vendor-neutral LCMS login, processing, and results viewing suite used by GSK in the Philadelphia area. He is a founding member of the ASTM committee charged with the AnIML analytical data standard.

Dr. Steve R. Cole is Manager, Analytical Sciences, Chemical Development, at GlaxoSmithKline. He leads scientists who provide analytical deliverables for late-phase drug candidates and walk-up chromatography to > 250 chemists on 8 sites. He received his Ph.D. from Professor John Dorsey in 1992 and has 20 years experience in the pharmaceutical industry.

Dr. William K. Young is an Investigator in Analytical Sciences at GlaxoSmithKline in Stevenage, UK. He develops and maintains the walkup chromatography systems within Chemical Development. He received his Ph.D. from Imperial College, London with Professor W. John Albery and has 11 years experience in the pharmaceutical industry.

Helen E. Weston is a Principal Analytical Scientist at GlaxoSmithKline. She is responsible for ensuring consistency and quality of results for 56 instruments globally. She gained practical troubleshooting skills as a purification specialist in medicinal chemistry, developing new automated SPE equipment. She joined GSK after graduating in Chemistry from Oxford University.

  • <<
  • >>