Structural Information to Aid in silico Therapeutic Antibody Design from Next-Generation Sequencing Repertoires

Introduction

Approvals of antibody and nanobody therapeutics are increasing rapidly.1 The Antibody Society has estimated that 2020 will see fifteen to twenty new therapies approved for use in the US or EU, up from a 2014-2018 four-year average of ten. This success has been built on the continual optimization of in vivo and in vitro methods of therapeutic discovery. For example, transgenic mice generate more human-like antibodies against injected antigens2 and high-throughput binding and developability assays have allowed researchers to scan more antibodies than ever before for their suitability.3 However, these methods are often time- and resource-intensive, sampling many ineffective antibodies en route to identifying a feasible candidate. Such inefficiency is likely to be unsustainable in the coming era of personalized medicine.

In silico (computational) methods aim to minimize this early-stage expense and reduce lead times by rationally proposing antibodies for experimental assessment. This can range from epitope-aware de novo design4,5 to improved ‘developability’ assessment.6-8 Most techniques currently work at the level of the antibody variable domain sequence. This is because sequence data vastly outweighs structural data, and, historically, methods of antibody structure prediction have been prohibitively slow or inaccurate. However, it is ultimately the structure of the variable domain that governs an antibody’s binding characteristics and biophysical properties, and much of this cannot be inferred from sequences alone.

Subscribe to our e-Newsletters
Stay up to date with the latest news, articles, and events. Plus, get special offers
from American Pharmaceutical Review – all delivered right to your inbox! Sign up now!

Over the past five years, our group has developed numerous algorithms to quickly and accurately structurally characterize antibodies. We have recently shown that it is possible to directly harness these representations to avoid poor developmental decisions.8 Here, we focus on how our tools, which are all freely accessible online, can be used in conjunction with large next-generation sequencing datasets for antibody drug discovery.

Structural Antibody Databases

The Protein Data Bank (PDB) is a free resource of over 150,000 solved protein structures.9 Around 2% contain antibodies or nanobodies, yet there is no way to directly search the PDB for these entries. The Structural Antibody Database (SAbDab, http://opig.stats.ox.ac.uk/ webapps/sabdab) mines the PDB weekly for all proteins that align to B-cell germlines,10 reliably identifying entries containing antibodies and nanobodies. All of these immune proteins are numbered using ANARCI11 (http://opig.stats.ox.ac.uk/webapps/anarci) and are searchable by their attributes, like variable domain sequence, interface properties, or by characteristics of their Complementarity- Determining Regions (CDRs), such as length or composition. Recently we have also developed Thera-SAbDab12 (http://opig.stats.ox.ac. uk/webapps/therasabdab), which scans SAbDab for near and exact sequence identity matches to over 450 WHO-recognized antibody and nanobody therapeutics. As of July 2019, around a quarter of these monoclonal therapeutic antibodies were structurally represented, while nearly half of bispecifics had at least one variable domain structure of interest.

Antibody Next-Generation Sequencing Data

Next-Generation Sequencing of Immunoglobulin genes (Ig-seq) is providing ever expanding snapshots of the adaptive immune system.13 Currently, most Ig-seq studies sample between 105-107 antibodies.14 While only a relatively small sample of the overall variable domain diversity (around 1011 antibodies according to a recent estimate15), these samples already appear sufficient to perform meaningful informatics.

Ig-seq datasets have historically been scattered across many websites, supplied as raw nucleotide reads. Our group has created the Observed Antibody Space (OAS, http://antibodymap.org/oas) database that pools these datasets together into a single repository of cleaned, annotated, and numbered amino acid sequences.14 OAS currently contains around 1.2 billion entries. These are entire heavy chain (VH; 1.15Bn) and light chain (VL; 60M) variable domain sequences from across 65 separate Ig-seq studies. Original metadata is preserved to allow interrogation of specific repertoire categories (species, vaccination status, isotype, etc.).

We have demonstrated that the OAS resource is already useful for drug discovery. We scanned OAS for sequence matches to 242 clinical- stage therapeutic antibodies, and found 54 perfect matches between therapeutic CDRH3s and natural CDRH3s – the CDR often considered to be the dominant driver of specificity.16 We also identified two clinical stage therapeutics where all heavy and light chain CDRs matched perfectly to those of an Ig-seq VH and VL sequence. This high degree of overlap strongly supports the case for using Ig-seq data in early- stage discovery. The metadata of matches were also reflective of developmental origin (e.g. -umab therapeutics matched more closely to human Ig-seq data), suggesting that OAS could be harnessed for humanness assessment.

Cleaning Ig-seq Datasets

Raw Ig-seq data is known to contain sequencing misreads.15 It is therefore important to filter out sequences with a high likelihood of containing an error. There are multiple computational tools for this purpose; our AntiBOdy Sequence Selector (ABOSS, available from http://opig.stats.ox.ac.uk/resources) is the first to use structural information. ABOSS analyzes the proportion of sequences missing a key structurally conserved disulfide bridge to calculate an approximate error rate for each Ig-seq study.17 It then analyzes every sequence in turn, comparing the residue seen at each position to the residues represented at the same position across the rest of the dataset. If that residue occurs less often than the calculated error rate, it is annotated as a potential sequencing error. In this way, ABOSS can filter Ig-seq datasets to leave only sequences of the required fidelity.

Structurally Characterizing Ig-seq Datasets

Current analysis of Ig-seq data is predominantly performed at the sequence level. For example, clonotyping, a popular method to group sequences by inferred genetic origin,15 usually involves two sequence-based descriptors: V/J gene identity and CDRH3 sequence identity or similarity. Antibody maturity can be estimated by the number of mutations away from the aligned germline sequence. While these features can be informative, they can also be misleading, as it is possible for CDRH3s of dissimilar sequence (e.g. < 30% sequence identity, and different V and J genes) to have remarkably similar structures and vice versa.18

Structural analysis methods stand a better chance of identifying these cases. The Structural Annotator of AntiBodies (SAAB, http:// antibodymap.org/structure) can accurately and rapidly map structural features onto Ig-seq datasets.19 It utilizes the fact that five of the six CDR loops cluster into broadly conserved backbone structures, known as canonical forms, and that these can be quickly predicted from sequence using our SCALOP (http://opig.stats.ox.ac.uk/webapps/ scalop) software.20 As CDRH3 structures are more diverse, we model this loop using FREAD (http://opig.stats.ox.ac.uk/webapps/fread) to find a sequence-similar template with suitable grafting characteristics.21,22 SAAB can annotate Ig-seq VH sequences with structural information at a rate of 5 x 104 per hour parallelized over ten cores, allowing the structures within most Ig-seq repertoires to be analyzed in just a couple of days. It successfully classifies over 95% of canonical loops and benefits from high structural coverage of the CDRH3 loop lengths that tend to be used in therapeutics.8

While VH is often assumed to drive binding specificity, VL (which is packed tightly against VH) can also provide crucial complementarity- determining interactions and can expand destabilizing surface interaction patches. Inclusion of VL in bioinformatic analysis pipelines could therefore yield more reliable binding predictions and developability assessment.

VH and VL pairings can be formally structurally modeled using ABodyBuilder23 (http://opig.stats.ox.ac.uk/webapps/abodybuilder), part of our SAbPred suite (Figure 1). ABodyBuilder derives a heavy and light chain orientation (ABangle24), homology models the framework and each CDR loop (SAbDab/FREAD) – or uses a hybrid homology/ab initio method to model more complex loops (Sphinx25, http://opig.stats.ox.ac.uk/webapps/sphinx), and models antibody side chains (PEARS26, http://opig.stats.ox.ac.uk/ webapps/pears). Its accuracy compares favorably to other antibody modelers23, and it is sufficiently quick for high-throughput Ig-seq analysis; each prediction only takes 20s on a single core if all loops can be homology-modeled. A unique feature of ABodyBuilder is its ability to provide a statistical estimate of model quality, based on benchmarking against SAbDab structures. This allows poor quality models to be stripped out of any analysis pipeline, improving prediction reliability.

Figure 1. The SAbPred Suite for analyzing antibodies.

Most Ig-seq datasets in OAS only contain the VH sequence, and a few record both VH and VL, but in an unpaired manner. Therefore, to use ABodyBuilder with existing public Ig-seq data, one must either pick a set of in-house or germline-based light chains, or a representative set of unpaired light chains from an Ig-seq dataset, and combinatorially combine them with the VH sequences. ABangle similarity thresholds can then be implemented to trim out those VH/VL combinations likely to be unstable, or whose orientations cannot be modeled reliably.8

Application: Antibody Model Libraries

Unlike biologic drug discovery, small molecule drug discovery benefits from in silico high-throughput screening, where large libraries of drug-like molecules or fragments are ‘docked’ into a target of interest to predict whether they will bind. Combinatorially pairing Ig-seq VH and VL sequences, followed by structural modeling, can yield diverse human antibody model libraries (AMLs), which are exploitable in a similar manner. These prospective antibody leads ought to have a low risk of immunogenicity, given their natural origin. The huge number of potential VH/VL combinations necessitates the use of a clustering protocol (e.g. based initially on sequence similarity, then on predicted structural similarity, and finally on model similarity8), to achieve as succinct a summary of binding site diversity as possible.

Application: Model-Based Developability Assessment

Beyond high-throughput screening, human AMLs can also help predict developability. We compared two of our largest human AMLs to models of a set of therapeutic antibodies in advanced clinical trials.8 Calculating several developability-linked variable domain descriptors, we saw that therapeutic models were like human antibody models in their surface charge properties, but differed in their CDRH3 length distribution and in their proximity/number of surface hydrophobic residues.

This realization led to the Therapeutic Antibody Profiler (TAP8, http:// opig.stats.ox.ac.uk/webapps/tap), which provides a set of five Lipinski- esque27 guidelines for antibody therapeutics. By comparing a model of a submitted sequence to models of (currently) 377 post-Phase I therapeutics, TAP assesses whether a candidate has unusually long loops, or particularly localized levels of charge or hydrophobicity for a therapeutic. We selectively identified two problematic candidates that TAP would have advised against making, and several biopharmaceutical companies have decided to use TAP in-house for its reliability in highlighting their candidates with developability issues.

Conclusion

As described above, Ig-seq datasets are likely to contain low- immunogenicity antibody leads for therapeutic development. Their structural annotation promises the isolation of candidates with improved biophysical properties, reducing the risk of unwanted aggregation and viscosity. Equally, structural informatics methods should more reliably identify the most complementary antigen binders.28 Ig-seq datasets themselves will only become more accurate representations of the immune repertoire, as paired-chain sequencing emerges as a high-throughput technique. We therefore believe that the structural annotation of Ig-seq data has immense potential and could revolutionize drug discovery.

References

  1. Kaplon, H. and Reichert, J.M. Antibodies to watch in 2019. mAbs. 2019;11(2):219-238.
  2. Zuberi, A. and Lutz, C. Mouse Models for Drug Discovery. Can New Tools and Technology Improve Translational Power? ILAR Journal. 2016;57(2):178-185.
  3. Parola, C., Neumeier, D., Reddy, S.T. Integrating high-throughput screening and sequencing for monoclonal antibody discovery and engineering. Immunology. 2018;153(1):31-41.
  4. Nimrod, G., Fischman, S., Austin, M., et al. Computational Design of Epitope-Specific Functional Antibodies. Cell Rep. 2018; 25(8):2121-2131.
  5. Sormanni, P., Aprile, F.A., Vendruscolo, M. Third generation antibody discovery methods: in silico rational design. Chem. Soc. Rev. 2018; 47:9137-9157.
  6. Sormanni, P., Amery, L., Ekizoglou, S., et al. Rapid and accurate in silico solubility screening of a monoclonal antibody library. Sci. Rep. 2017;7:8200.
  7. Jarasch, A., Koll, H., Regula, J.T., et al. Developability Assessment During the Selection of Novel Therapeutic Antibodies. J. Pharm. Sci. 2015;104:1885-1898.
  8. Raybould, M.I.J, Marks, C., Krawczyk K., et al. Five computational developability guidelines for therapeutic antibody profiling. Proc. Natl. Acad. Sci. USA. 2019;116(10):4025-4030.
  9. Berman, H.M., Westbrook, J., Feng, Z., et al. The Protein Data Bank. Nuc. Acids Res. 2000;28(1):235-242.
  10. Dunbar, J., Krawczyk, K., Leem, J., et al. SAbDab: the structural antibody database. Nuc. Acids Res. 2014; 42(D1):D1140-D1146.
  11. Dunbar, J., Deane, C.M. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics. 2015;32(2):298-300.
  12. Raybould, M.I.J, Marks, C., Lewis, A.P., et al. Thera-SAbDab: the Therapeutic Structural Antibody Database. BioRxiv. 2019; doi: 10.1101/707521
  13. Georgiou, G., Ippolito, G.C., Beausang, J., et al. The promise and challenge of high- throughput sequencing of the antibody repertoire. Nature Biotech. 2014;32:158-168.
  14.  Kovaltsuk, A., Leem, J., Kelm, S., et al. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. J. Immunol. 2018; 201(7):2502-2509.
  15. López-Santibáñez-Jácome, L., Avendaño-Vázquez S.E., Flores-Jasso, C.F. The Pipeline Repertoire for Ig-Seq Analysis. Front. Immunol. 2019;10:899.
  16. Krawczyk K., Raybould, M.I.J, Kovaltsuk, A., Deane, C.M. Looking for Therapeutic Antibodies in Next-Generation Sequencing Repositories. mAbs. 2019; doi:10.1080/19420862.2019.1 633884
  17. Kovaltsuk, A., Krawczyk, K., Kelm, S., et al. Filtering Next-Generation Sequencing of the Ig Gene Repertoire Data Using Antibody Structural Information. J. Immunol. 2018;201(2):3694-3704.
  18. Kovaltsuk, A., Krawczyk, K., Galson, J.D., et al. How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data. Front. Immunol. 2017;8:1753.
  19. Krawczyk, J., Kelm, S., Kovaltsuk, A., et al. Structurally Mapping Antibody Repertoires. Front. Immunol. 2018;9:1698.
  20. Wong, W.K., Georges, G., Ros, F., et al. SCALOP: sequence-based antibody canonical loop structure annotation. Bioinformatics. 2018;35(10):1774-1776.
  21. Choi, Y., Deane, C.M. FREAD revisited: Accurate loop structure prediction using a database search algorithm. Proteins. 2010;78(6):1431-1440.
  22. Choi, Y., Deane, C.M. Predicting antibody complementarity determining region structures without classification. Mol. BioSyst. 2011;7(12):3327-3334.
  23. Leem J., Dunbar, J., Georges, G. Shi, J., Deane, C.M. ABodyBuilder: Automated antibody structure prediction with data-driven accuracy estimation. mAbs. 2016;8(7):1259-1268.
  24. Dunbar, J., Fuchs, A., Shi, J., Deane, C.M. ABangle: characterising the VH-VL orientation in antibodies. Protein Eng. Des. Sel. 2013;26(10):611-620.
  25. Marks, C., Nowak, J., Klostermann, S. et al. Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction. Bioinformatics. 2017;33(9):1346-1353.
  26. Leem, J., Georges, G., Shi, J., Deane, C.M. Antibody side chain conformations are position- dependent. Proteins: Struct., Funct., Bioinf. 2018;86(4):383-392.
  27. Lipinski C.A., Lombardo, F., Dominy, B.W., Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997;23(1-3):3-25.
  28. Raybould, M.I.J, Wong, W.K., Deane, C.M. Antigen-antigen complex modelling in the era of immunoglobulin repertoire sequencing. Mol. Syst. Des. Eng. 2019; doi: 10.1039/ C9ME00034H
  • <<
  • >>

Join the Discussion