Big Data in Clinical Research — Sharing, Dissemination, and Repurposing


Event Details



Sanchita Bhattacharya, Bioinformatics Project Lead at Bakar Computational Health Sciences Institute, UCSF School of Medicine

In the field of clinical research, we are just beginning to explore repurposing public datasets to build a knowledge base, gain insight into novel discoveries, and generate data-driven hypotheses that were not originally formulated in the published studies. This presentation will showcase the significant efforts in the meta-analysis of open-access immunological studies and secondary analysis of clinical trial data. The attendees will gain a clear understanding of recent trends in conducting data-driven science.

TRANSCRIPT:

John Fremer:

We would like to welcome you to the Sanguine Speaker Series Webinar: Big Data in Clinical Research – Sharing, Dissemination, and Repurposing. Today’s webinar is presented by Sanchita Bhattacharya, the informatics project lead at the Bakar Computational Health Sciences Institute at the UCSF School of Medicine. I will now hand it over to Sanchita.

Sanchita Bhattacharya:

Thank you for the nice introduction. Good morning from San Francisco. Good afternoon on the other side of the coast. Thanks for inviting me and I’m really excited to present our work to this audience. And today, I’m going to talk about the big data in clinical research – sharing, dissemination, and repurposing of open-access clinical and immunological data.

Sanchita Bhattacharya:

So clinical trials generate vast amount of data. A large portion is never published or made available to other researchers. On the other side, for the last few years, there’s a lot of initiatives going on for data sharing, which could advance the scientific discovery and improve the clinical care by maximizing the knowledge, which are gained from the data collected in the trials, stimulating new ideas for research and avoiding unnecessarily duplicative trials.

Sanchita Bhattacharya:

So in 2015, IOM, Institute of Medicine, part of the US National Academy of Sciences, released a report which provided guiding principles and a practical framework for the responsible sharing of the clinical trial data. The chart outlined here shows the major stages of the clinical trial life cycle, and what data should be shared, when to share specific data packages in common scenarios in order to help amplify the scientific knowledge, all while minimizing the risk.

Sanchita Bhattacharya:

As you can see here that different types of data were in various stages of the life cycle, the data is getting collected like metadata. That’s the data about the data, like protocols, statistical analysis plan, analytical code. And the data, the metadata, collected on the individual participant data, that’s called the subject-level or the raw-level data. That’s the data that are collected from the participants. Primarily I’m referring to the raw data, which is then further cleaned, then abstracted, coded and transcribed to become an analyzable data. Then that’s called mostly the summarized data. And then once all the data is analyzed and ready for publication or submitting into the regulatory agencies, there are post-publication data packages, full data package. And so, you can see here the breadth of data that is getting collected during the clinical trial life cycle.

Sanchita Bhattacharya:

On the other side, and then as I mentioned earlier that industry and regulatory forces are driving these initiatives to publicly share the patient-level data from clinical trials. And there are a lot of electronic portals getting built to share the data, and that’s the platform for kind of generating better evidence, facilitate additional findings, confirm the published results and ultimately improve the public health.

Sanchita Bhattacharya:

On the other side, there’s a lot of data. When I say individual data, I’m also referring to the mechanistic data. As you know that immune system is one of the very complex and dynamic biological system, which is comprises of diverse cell types, cytokines and immune components with varying functional states. So there are lots of data that you can collect at the population level and a single-cell level, and the proteomics, transcriptomics, and genomics and epigenomics, and there are a lot of [inaudible 00:04:12] technologies collecting the data on the proteomic side, the flow cytometry, and then lately, our single-cell level data getting collected on the CyTOF platform, mass spectrometry, and then microarrays. Tons of this data was getting collected in the last decade on microarrays and now moving to RNA-seq and single-cell RNA-seq, DNA-seq, CHP-seq, and you name it. And so, we are definitely in data deluge here with lots of data from high-tech data getting collected, but how we transform this into knowledge?

Sanchita Bhattacharya:

So there are over 2,000 open-access data repositories. Typically the generalist repositories, like Figshare, you might have heard these names: Figshare, Dryad, Zenodo, Mendeley. The publication data is getting uploaded to these repositories to share the data, or they share the data from the table generated in the publication. The underlying data is getting submitted to these generalist repositories, but there are also open and domain-specific repositories like GEO and NSRA and a lot of other databases. And as you can see here, the data access… There are a lot of levels. Some are open, some are restricted, embargoed, or there are data access restrictions. You need to register, or there’s a fee required, or there’s an institutional membership. So some of these have some requirements. And also the data upload side, not only just data access, but data upload. Again, registration, or you need to a part of a consortium, and there’s a database around that consortium where they’re submitting the data. And then I already mentioned about the generalist and the domain-specific like GTEx, TCGA, and many others.

Sanchita Bhattacharya:

Today I am going to highlight one such repository, ImmPort, that allows open access to clinical and immunological datasets. I’m the scientific program director for ImmPort, so you’re going to be hearing more about the initiatives we have taken to repurpose the data that is submitted to the ImmPort database. But next few slides, I’m going to talk a little bit about the database. So ImmPort data portal was developed to collect and share research and clinical trials data from the National Institute of Allergy and Infectious Disease, Division of Allergy and Transmutation-funded research about over a decade ago, and so, to promote the FAIR principles of findability, accessibility, interoperability, and reusability of the data.

Sanchita Bhattacharya:

And so, ImmPort ecosystem comprises of private data. That’s a private workspace where the data providers, data submitters, they upload the data and once they are ready to share the data, data is embargoed till we hear from the investigators, then they are ready to publish their paper and release the data to the research community. Then it goes to the shared data space. And then we also have resources and data analytical tools, that’s under data analysis. We have workflow, and as we go, I’m going to explain little of these analysis tools that we have and the resources that we have built over the years.

Sanchita Bhattacharya:

Regarding how to access ImmPort, [inaudible 00:07:52] easy. If you have an email of your choice, you just register. Just provide your name with an email address and you will be given access to these datasets. There’s no proposal which we have seen in other databases. You don’t need to write a proposal. And there’s a committee, they review the proposals, but on the other hand for ImmPort, it’s just we need a valid email address, a name, and you’re good to go.

Sanchita Bhattacharya:

So who is distributing the data? Quite often I get asked that, where are you getting all this data from? So ImmPort redistributes the data from major NIAID-funded programs. And as you can see here the list, we have a long list of different NIAID programs, those who are submitting the data. We have over 30-plus programs submitting clinical and basic immunology research data to this portal. And lately we also have some COVID-19 data coming to ImmPort. I have highlighted few on the left-hand side with the logos, like Human Immunology Project Consortium. They’re a big data contributor. Through this program, well-characterized human cohorts are studied using a variety of modern analytical tools like transcriptional cytokine proteomic assays, multi-parameter phenotyping of the leukocytes subsets and assessment of the functional status of leukocytes.

Sanchita Bhattacharya:

We also work with Bill and Melinda Gates Foundation, so we have TB data coming from BMGF Foundation. March of Dimes for the pre-term birth research data. And then we also have data from AMP, that’s the Accelerating Medicines Partnership in Rheumatoid Arthritis and Lupus. It’s a joint collaboration between NIH, NIAMS, that’s the National Institute of Arthritis and Musculoskeletal and Skin Diseases, industry and academia coming together, collecting high throughput, multiomics data, such as CyTOF or bulk or single-cell RNA-seq, and many more from those arthritis and lupus patients. We also have some cancer immunotherapy data that’s coming from the Oncology Models Forum, which collects the data from various cancer models. Genetically engineered, transportation-induced, and spontaneous mammalian models. So as you can see the breadth of data, and there are many more, that we are getting a lot of transplant data as well, and they’re all getting submitted and shared through ImmPort.

Sanchita Bhattacharya:

So the focus areas and data types in ImmPort, just to give you a flavor of the type of studies. As you can see here from the pie chart that we are heavily on the vaccine response, so we have about 143 studies on vaccine response, immune response, infection response, transplantation, autoimmune diseases, allery, preterm birth. And regarding the diseases, we have roughly about 112. Most of them are related to infectious disease. We have influenza, aging, transplantation. And then we also have a lot of pregnancy data, small [inaudible 00:11:25], and smallpox, lupus, dengue.

Sanchita Bhattacharya:

And in terms of looking at the experiments, the type we support, multimodal data. Not only just one data type, but then we support flow cytometry, which we are one of the largest repository for flow cytometry and transcript profiling, ELISA, Luminex, all these immune measurement techniques. So we support RNA-seq, CyTOF. We have lots of CyTOF data, and then HLA-typing data, [inaudible 00:12:03] microarrays and so on.

Sanchita Bhattacharya:

And then here is the distribution of the human subjects, as you know. Most of the database is a little bit biased on the white population. And then we have all other races and ethnicities as well. Immunologic research, we have about 305 studies, and there are quarterly release of new studies, so every quarter we have somewhere from… It depends, but somewhere between 10 to 20 studies get released, and sometimes more. So the data summary that I’m showing you today is from September 2020, and very soon we’ll be coming up with a next data release and we have about 140 clinical trials.

Sanchita Bhattacharya:

A little bit about the data models, so ImmPort database is a relational database. So the data is stored in tables, and ImmPort database is very study-centric. So we will start with a study. We describe a study, and then within a study… Let me just give you an example. A study has subjects, and then subjects also belongs to a pool or a cohort, which means treatment versus control [inaudible 00:13:23]. And from the subjects you could have the biosamples, and biosamples could be like [inaudible 00:13:28], for example, and then the [inaudible 00:13:31] from [inaudible 00:13:31], you isolate in the serum or plasma. So that is considered to be an experimental the sample. You run the experiments on these experimental samples. The protocols from the experiments are also curated in the database. And so from experimental samples, there are different types of experiments getting carried out from ELISA to flow cytometry and many other immune measurements.

Sanchita Bhattacharya:

One of the other things that I would like to mention that when we are looking at data harmonization, where the data standardization is very important, so data standardization and use of ontologies. So on the extreme right, you could see that we use a lot of these different ontologies to standardize the data because this data is coming from different providers and it’s diverse. And we do have… And I’ll show you how we collect the data, but then these ontologies helps you to standardize the data.

Sanchita Bhattacharya:

So the data ingestion is in the form of like templates. These are the data templates, exactly the tables that I showed you before when we give the data providers these data templates and which helps them to populate these templates. And then once they publish the templates with all this information, some are required information, some are optional. Or mandatory versus optional columns. As you can see, the ones in the gray is optional, but there are required columns. And these are the data templates. To these data templates, they submit the data to ImmPort.

Sanchita Bhattacharya:

In the interest of time, just wanted to give you an idea about how the data gets ingested into ImmPort. So once the data is in ImmPort, as I said, from different programs, now I would like to show you… This is one of my favorite slide… I would like to show the breadth of the data, as well as in a unified way, how you can leverage all these data stats. So this is an example of the influenza vaccination in cohorts in ImmPort database, coming from different sources. Yet you could see here that the bulk of the chart illustrates the variety of this open-access through vaccination associated studies in ImmPort. And the big, blue circle here consists of the healthy vaccinated individuals and the number represents the number of study participants. So the datasets can range from a new dataset to fully integrated data generated on a variety of platforms, and behind you have like flow cytometry, microarrays, sequencing, ELISA, on thousands of patient samples.

Sanchita Bhattacharya:

But this clearly demonstrates the accessibility of large volume of datasets within an area of interest in one central location. You can see in healthy, you have a lupus cohort around the common theme of all of them are like flu-vaccinated cohort with either a co-morbidity or other features like there are some aging population, twins. If you are doing some genetic studies on vaccinated individuals, so these are twins. So you could study the genetic as well as the environmental differences. Or a pregnancy cohort or transplant data. So this clearly, with the increasing awareness of the importance of the sharing research data and research findings, this is an additional need to showcase how to best leverage these shared datasets across research domains.

Sanchita Bhattacharya:

So two case studies. Our group has successfully demonstrated the measure efforts in meta analysis of these open immunological data to bring knowledge base, gain insight into discoveries, and generate data-driven hypothesis, those which were not originally formulated in the studies. And here, this is a [inaudible 00:17:44] where you could see that the primary investigators, they collect the data. It goes to the repositories and then bioinformaticians like us download that data. We come up with new, novel hypotheses, and again, these are results into secondary publications. We do validations. And then, we publish papers, it goes back to the database. And there is a paradigm shift. The experimentalist or basic immunologist, they start with a priori hypothesis. They run the experiment. We are more looking at the big data, data-driven hypothesis generation, and then bring out a model and find out some novel discoveries or novel findings.

Sanchita Bhattacharya:

The next few slides are going to be all about the use cases, how we have leveraged the data. So the first one is a RAVE analysis. So this paper came out back from our group in 2015, where this clinical trial data was submitted to ImmPort. And then, that’s called the RAVE trial. It’s an immune tolerance, [inaudible 00:19:11] trial for rituximab, for the role of rituximab being ANCA-associated vasculitis. So I’ll just go over a little bit about the trial. The first paper was published on this trial was by Stone, et al in New England Journal of Medicine in 2010. And the trial is a randomized trial. Double-blinded, active-controlled, phase II/III trial where they’re comparing rituximab to cyclophosphamide for the induction of remission.

Sanchita Bhattacharya:

As you can see on your left, that there were two groups, rituximab and cyclophosphamide, and they are cyclophosphamide plus glucocorticoid given to these subjects. Two arms, study visits. They were followed up to 18 months and there’s a study data. Data was collected from both the treatment groups. There was a remission induction phase and the remission maintenance phase, and the types of data that was collected and on the clinical side was the assessment lab test, adverse events, concomitant medications. And then on the mechanistic side, they ran the flow cytometry panels, ELISA, and the gene expression data. As you can see, lots of data, and all this data by phase went into ImmPort. So this trial had about 197 participants. It was a multicenter trial.

Sanchita Bhattacharya:

So what was observed and published by Stone, et al was that 63 of the 99 patients in the rituximab group, which is roughly about 64%, reached the primary end point, that is the complete remission. As compared to 52 out of 98, which is like 53% of the cyclophosphamide group. So this is the data we got, and that’s what we learned from the peer reading the paper. Once the data went into ImmPort, we downloaded the data, and I would like to walk you through the process that as a bioinformatic scientist or data mining, how we download this data.

Sanchita Bhattacharya:

So once we download the data and reading the paper, the question that came to our mind is that 35% of the patients treated with rituximab and 47% treated with cyclophosphamide failed to achieve remission. So retrospectively do any major factors predict the response to therapy? Can we find out from this raw data submitted what might predict response to therapy and why these patients didn’t achieve remission?

Sanchita Bhattacharya:

So the process was that since the data was available, we looked at the clinical data from the first six months. And then the mechanistic data, we downloaded the flow cytometry data. We also downloaded the ANCA titers, which were available from this data. And then they were flow cytometry data. We had 15 panels, 36,000 files, so we used an automated gating approach to do a high-throughput analysis of this flow cytometry data. And the very first thing that we did is that since the paper was published, we wanted to validate that all the data that we downloaded kind of matched with what’s reported in the paper. And so in the paper, they did a manual gating and we kind of reproduced the same results using the automated gating of the flow cytometry data. And as you can see here, the graph on the left, as well on the right, that they match very well.

Sanchita Bhattacharya:

So that gave us a lot of comfort in terms of like, “Yeah, we can proceed further with this data because we were able to reproduce the results.” And then we went further and we asked the question, because that was not asked by the primary investigators, how about looking at the data at the baseline? No. So before treatment we would like to see, and since we were fortunate enough to have that baseline data, we wanted to see from the flow cytometry data is there any difference?

Sanchita Bhattacharya:

So what you see here in the plot, the value plots, that we have two groups, rituximab treatment group and cyclophosphamide treatment group. At the baseline, we looked at the granularity index. So basically we were focusing on the granulocyte and we saw some differences. And what we saw here, as you can see here, that within the granulocytes there are subsets. Some have low… We call it low granulocytes, meaning hypogranulated, and high means hypergranulated subsets. And so as you can see here, that there is a gradient. So the ones which are in light pink or to dark green, light pinks are only hypogranulated, and the lows which are in the dark green are hypergranulated. So now, if you look at the rituximab group and the cyclophosphamide group, in the rituximab group, if you look at the x-axis in the success, those who achieved complete remission have a little bit higher, or it’s a low granulocyte, means hypogranulated. Versus those who were in the cyclophosphamide group, they were more on the hypergranulated side.

Sanchita Bhattacharya:

And that’s what we show in the value plots, where you can see here that those who achieved remission in the right, rituximab treatment group, if you see the granularity index, you could see that they had a little bit higher compared to the cyclophosphamide group, where they had a lower granularity index. So kind of hyper and hypo. Just for the sake of simplicity, we just [inaudible 00:25:31] it as a granularity index.

Sanchita Bhattacharya:

And so here with these 187 subjects, then we were able to partition them. As you can see here, you have GI index and look at the rituximab cyclophosphamide group, and basically this is kind of like more of a personalized treatment option based on the cellular profiling and subset in the groups, and we can predict now based on those who are higher granulated at the kind of baseline, you can see then you take this route and who is going to achieve remission. And those who are less, how you go on this side, on the left side of the arm, and who is going to have a complete remission.

Sanchita Bhattacharya:

So basically we give a kind of a more treatment option, personalized treatment option. So the current method from the first paper… It’s a non-profile therapy. It’s like one size fits all, but then we propose a method based on a granularity index and how you could have a profile therapy. So this is our first paper about reanalyzing the clinical trial data and come out with like a novel way of treating the patients. So that when the follow-up trials, the clinical trialist, those who designed the clinical trials can take this into consideration for the future trials.

Sanchita Bhattacharya:

So based on this success, then we embarked on another project. This paper came out in Cell Reports in 2018. We built from this ImmPort data, a 10,000 Immunomes Project. It’s a building of resource for human immunology. So the motivation for this project came from resources such as 1000 Genomes Project, Wellcome Trust Case Control Consortiums and others, who have uniquely enabled understanding of global variation in human genome, in health and disease.

Sanchita Bhattacharya:

To date, however, human immunology has no such resource, and given the recent growth in open immunologic data and using, as you can see, that lots of data getting collected to ImmPort, this was an attempt to synthetically construct the reference immunome by integrating the individual label data from publicly available immunologic studies within ImmPort. And as you know, that moreover, the immuneassays have not been reproducibly characterized for a significant, large and diverse, healthy human cohort. So what I mean here is the baseline data. So what we built was a large, diverse, clean reference dataset, and then interactive data visualization, custom control cohorts and standardized data download. That’s what we have provided in the 10,000 Immunomes Project.

Sanchita Bhattacharya:

Let me walk you through this process. So we did a manual curation of the studies, the control or treatment arm and planned disease to filter for normal human subjects. So we just subset the data within ImmPort, coming from different studies. We just looked at the control arm, so-called normal healthy individuals. So they were about 242 studies when we started this project, with 44,000 subjects, 290,000 plus samples. So we looked at only control arms with no manipulation, and then we filtered for only normal controls. We went over the ImmPort study design, read the inclusion/exclusion criteria. We even contact the authors, those who submitted the data to ImmPort, for the sake of clarity. And then we also read the protocols.

Sanchita Bhattacharya:

So after going through all this rigorous, filtering process, we were left with 85 studies. And as you can see here on the right, each study ImmPort has the study design and types of data getting captured. So we manually read all of the study. And then from these 85 studies, the type of data those were collected, are like CyTOF and flow cytometry data. They were mainly the cytokine data, and then other data types. Gene expression data. So we standardize all of data. We had a standardized pipeline for data cleaning and harmonization. So basically, automated [inaudible 00:30:15] flow cytometry. We also had to standardize the cell subset names, and then validate it against gold standard hand-gated populations. So a lot of this data science and data standardization and harmonization went behind all this data curation.

Sanchita Bhattacharya:

The table on the left is the number of samples, number of subjects. We had 10,000 plus subjects, and type of measurements and the types of data. These are the ones where you can look at ELISA, virus titers, clinical lab tests. We also collect lab tests. So we had CBC [inaudible 00:30:56], metabolic panels, lipid profile, cytometry, and so on. Just to give you an idea about… Like we have a pretty wide age span, somewhere from newborns to older cohorts. And then also, in terms of ethnic groups, we have a good distribution, but as mentioned before, that each of these data [inaudible 00:31:23] was a little bit biased about the white population, and then followed by Asians and others.

Sanchita Bhattacharya:

So basically after the standardized data for data dissemination, whatever we learned, we curated. We have a website called 10KImmunomes.org, where you could go to this website and based on your data of interest, this is all built on the [inaudible 00:31:52] platform, and amino acids or transcriptomics, proteomics. Depending on your interests, you can look at the data, and you can do all this in real time. This is an example of the interactive property. This is a screenshot of the web based 10K Immunome resource. So the graph updates in real time. The subject counter updates, and then you can select or unselect what you are not interested in. Or plot by ethnicity by study as well, and age and sex. And so, that kind of gives you an idea. And as well, you could download. So you can download the image, the data that is plotted here, and all of the data. So this is all, as I said, our standard.

Sanchita Bhattacharya:

So now, what you could do with this. When we published the paper, we wanted to show that as a use case, that you could build like cell-cytokine networks from these baseline data. Or there was one interesting study in ImmPort where the primary investigators were looking at the variation in serum cytokines over the course of pregnancy. So they were looking at the immune perturbations during pregnancy. And so, this study had been published previously. However, what we observed that the study design did not incorporate a pre-pregnancy control. As you can see here, the date of the data from first trimester, second trimester, and then six months post-partum. So they had all these data, but they didn’t have any data before pregnancy, so pre-pregnancy data was missing.

Sanchita Bhattacharya:

So using this even like a baseline data from 10,000 Immunomes, we were able to show… As you can see here, that’s been highlighted in the red box, that we were able to plot the data to see if someone is interested to know what was a pre-pregnancy, the control data. Though those are from different women, but still you can control on the same plot to see the differences and the p-values of what are the changes happening over the pregnancy period. So this is one of these use case that even if you don’t have the control data, you can get the information, you can download the data from resources like 10,000 Immunomes, which has the baseline data, you can plot it with your own data.

Sanchita Bhattacharya:

So this is another project from ImmPort where this time we leverage the data on living donors. Mostly you hear a lot about the recipient data from transplant recipient, but this is like the data for living donors. In ImmPort, we have about 27 clinical trials on organ transplantations. So we curated 20 of them and there were some selection criteria. We didn’t include the deceased donors. We were only looking at the living donors, and then we took the data with 95% missing those over more complete data and eliminated where we didn’t have enough data.

Sanchita Bhattacharya:

So with those selection criteria, we were left with about 11,000 subjects, and as you can see that these data were coming from different studies. We compiled the data. The data was standardized, and the data that we collected were demographics. The data that we curated were demographics data, pre-transplant, the data that was available. And then intraoperative, post-transplant data, and also the relationship of these donors with the recipient.

Sanchita Bhattacharya:

So the question was what can you do with this data? So we then thought about how about we can do a trajectory analysis. So this is the trajectory analysis that I’m showing you here. These are all the living donors, and they were followed, and we looked at the data and some of them were followed somewhere between two years after they donate. And this is kidney transplant data that I’m showing you here. After they donated the kidney, they were followed somewhere from two years to four years.

Sanchita Bhattacharya:

So what we have plotted… And this is a trajectory analysis of these living donors, and it’s color coded based on if they had any surgical complications. That’s in red. And then the ones, those were non-surgical, in light blue. And first two years, you mostly see a lot of surgical complications. As the year goes by, we saw that in living donors, they come up with other secondary complications which are mostly non-surgical and some of them are going back to the waiting list. That is unfortunate, but we just are reporting the data that was submitted and what was reported. So here the data is all based on what were reported in the questionnaire, and some were in the long-term maintenance dialysis, and so on. So this kind of gives you an idea about the trajectory or the path once what the living donor is going through in the medical conditions. So this resource is also available in ImmPort if you’re interested.

Sanchita Bhattacharya:

And then the next project that I’m going to talk about… Let me see. How are we doing with the time? Okay. So this is about a robust and interpretable, end-to-end, deep learning model for cytometry data. And as I mentioned before, we are one of the largest repository for flow cytometric data. So we started using the deep learning to understand some of the cytometric data in ImmPort. So we built a convolutional neural network data from cytometry data. So here is this data that I’m talking [inaudible 00:38:31] CyTOF data. And basically, the cell surface markers in the cell. This is a matrix. We have different convolution layers. We pull the layers, and also we relate the cytometry data with the demographics. Non-cytometric data could be any clinical condition or demographics data, so we would like to see the influence of the cell subset populations based on the pre-clinical phenotypes.

Sanchita Bhattacharya:

So the dataset here in this case contains the CyTOF, and we also have the serological data from cytomegalovirus serological data from 472 healthy individuals across nine studies. We had a training set, and these are coming from different studies. Training set validation test, and then testing set. And so we were looking at CMV, which is a latent infections found in normal healthy individuals. So there are two populations here. Those who are CMV positive, CMV negative, and this was based on the serological data that they were kind of dividing into two groups, and then training, validation, testing. So the goal was to diagnose the latent cytomegalovirus in healthy individuals using the deep learning model.

Sanchita Bhattacharya:

So what we did is… If you look at this matrix, we had the cells. On the rows are cells, and surface markers are in the columns, so this is the original data. And then we did some up-sampling of this data. And then, that’s called the modified data. And so, this is the workflow for integrating the full, deep CNN model. So once we have sampled the data, we calculate the changes in the model output. So original data, modified data. We look at the delta Y. That’s the difference between the original and after the up-sampling. And so we did this based on… This is a decision tree that we use to interpret the output from the CNN model. And the CNN model actually identifies the association between the immune cell subsets and the CMV infection.

Sanchita Bhattacharya:

So here is a decision tree which identifies the cells that lead to the largest changes in the model output that I just described here, the delta Y when it was up-sample, and each node represents a cell subset. And the rules by which these populations split are indicated inside the nodes, and the values in each node represent the percent of the subset in the total population of the average change of the model output. And what you see here, the red box highlights the node with the highest mean of the delta Y. And so that’s how this is a permutation based method to integrate the deep CNN model, and using this approach, we were able to identify a cell phenotype which is associated with the CMV infection. And as you can see here, the box graph shows you the difference between the CMV negative and CMV positive subjects. And we consistently observe that there is this phenotype, CD3 positive, CD8 positive, CD27 negative, CD94% positive cells. That cell phenotype is associated with the CMV-latent infections in positive individuals.

Sanchita Bhattacharya:

So this all we derived using the CNN model. So basically, just to give you an idea that using deep learning, you can identify these cell phenotypes. And we have also provided the deep learning model tutorial in our paper, which was published in 2020 in PNAS from my group, which helps you to kind of go through this process, this pipeline. And so basically, we use the keras and TensorFlow to build or make this [inaudible 00:43:14] deep learning model. And then, just [inaudible 00:43:18] and if you have any, we can always talk offline, if you want to go into details.

Sanchita Bhattacharya:

Let me move on and show you another project. This is an unpublished work, but we are now leveraging real world, flow cytometry data from electronic health records. And in the last couple of years, the research community is witnessing a wide adoption of electronic health record systems, which generates the big, real world data. And so this opens up new avenues to conduct clinical research. So what you see here is that we have developed a tool for comparing. I have shown you before the 10,000 Immunomes, which was a baseline healthy individual data. So we were comparing the 10,000 Immunomes cohort to the datasets from EHR, and this is from UCSF electronic health records. We had over 4 million patients, and out of this, we were looking at and leveraging the clinical flow cytometry data. So we were very excited to see that that much of data was available on the clinical flow cytometry data, sitting in a database, or where we had the lymphocyte subset like CD8, CD4, CD56 and CD19 subsets from over 100,000 [inaudible 00:44:48] blood samples in 41,000 patients. So this is all coming from the patients’ medical history. And we have demographics data, diagnostic codes for all these lab tests, and so we were comparing the baseline healthy individuals to the ones where mostly… Since it’s the EHR, we have mostly patients with certain conditions.

Sanchita Bhattacharya:

And so if you see on the right, we have data from immunodeficiency primarily, a lot from the HIV group, transplantation infections, metabolic disorders, autoimmune disorders. This paper… We are about to submit this manuscript, but I can share some of the preliminary results. Here is the site or app. It’s not yet public, but it’s a simple tool which will enable the researchers to easily study in the real world population in EHR and which would serve as a model for other types of data. So what you’re seeing here is CD4 positive cells in patients with stem cell transplant status, and then we are comparing it to the healthy individuals from 10,000 Immunomes. The ones you see here being plot, on the left is 10,000 Immunomes baseline data and EHR data is on the right. And then we did run a [inaudible 00:46:16] analysis elucidating the different variables associated with the perturbations in cell count. And you can see it [inaudible 00:46:23] when these are the regulation coefficient.

Sanchita Bhattacharya:

And in the interest of time, I’ll just show you some of the additional analysis we did from this dataset. We were looking at the CMV disease and comparing the PBMCs, the cell counts in patients with stem cell transplant. This is [inaudible 00:46:43] stem transplant. But then along with stem cell transplant, they also have CMV disease. Zero with no disease and that one with CMV disease. And then we looked at the effect size of the CMV comorbidity, and then other autoimmune diseases, and looking at the differences in the CD8, CD4, CD19, CD56, all these different cell subsets. So just to give you a flavor of clinical pro-cytometry data from the real world data, this can transform into knowledge of data mining for future studies.

Sanchita Bhattacharya:

So I just want to give you some of the pointers, that if you are interested in the resources, go to the ImmPort website, which is open for all. And if you go to the resource tab, whatever we learn, whatever the projects that we try to accomplish and then publish it, and also we disseminate the information to the research community. So you can check out the ImmPort resources. We also have tutorials. Those who are bench immunologists or who are willing to learn more about how to use the APIs, ImmPort has APIs. If you want to retrieve the data using the programming interface, you could use the ImmPort APIs or you would like to learn basics of programming, we have a lot of tutorials. And also, if you have some interesting data, you can always contact us.

Sanchita Bhattacharya:

And I just want to leave you with some thoughts on the opportunities and challenges in democratizing [inaudible 00:48:33] research datasets. Here is the time and there are a lot of datasets sitting in silos. There’s a need. This is a call for the research community, let’s come together, let’s democratize the clinical research datasets, which are either like little datasets sitting in different repositories and then there are different tools, but how to improve on the accessibility, discoverability and interoperability?

Sanchita Bhattacharya:

I think I’ll end here with the take home message that I wanted to give you a flavor of holistic approach to analyzing clinical research data. Open-access immunological studies are a valuable resource to evaluate new in silico hypothesis and gaining novel insights. 10,000 Immunome Projects is one of them. It’s a framework for growing a diverse human immunology reference, which allows us to learn from the features and candidates that we already know and it enables us to explore new factors to be discovered. Deep convolutional neural network model I’ve shown you helps you to diagnose latent CMV infection in healthy individuals. There’s a vast landscape of clinical flow cytometry data and disease that could be explored through this app to generate new hypotheses. And I would like to end with that embrace open-access datasets. There’s a lot there and which can give you novel insights to the data. So please embrace open-access dataset and hopefully I was able to convince you.

Sanchita Bhattacharya:

And I would at least, but not last, the acknowledgments. Dr. Atul Butte was a PI for ImmPort, and he’s also director of the Bakar Computational Health Sciences. I would like to thank the data providers. Without that, we couldn’t have done this. And then we work with Northrop Grumman Health Solutions. Those who are the data curators. They maintain the database. This is a contract through NIAID, National Institute of Allergy and Infectious Diseases. So we are always in touch with NIAID program managers for all these projects that we launch. If you have any questions, please contact the help desk at ImmPort.

Sanchita Bhattacharya:

And if you are interested for more about big data, learning about big data in immunology, I’m offering and every year I offer a workshop at the FOCis workshop, and it’s coming up this year on June 28th. So if you want to attend a hands-on session, how to kind of retrieve the datasets from ImmPort or mining ImmPort data, feel free. And with that, I would like to thank and happy to take questions.

Lisa Scimemi:

So we have several questions that have come in. Start with: can you discuss the COVID-19 database, if possible? The strengths, the weaknesses of the public portal data that is available?

Sanchita Bhattacharya:

Sure. So we have about five datasets from COVID-19 and we are still building. The strength is, I would say, that raw-level data. The subject-level data that we are curating, so the data that’s going to ImmPort, and I have shown you that there is so much wealth of information available when you are looking at the raw data and subject-level data, where you could do a lot of these baseline analysis and personalized data. So that I would say are a strength and it’s raw data. We also have some summarized data, but raw data helps you if you have your own questions and would like to start from scratch.

Sanchita Bhattacharya:

In terms of weakness… Well, that’s more… I wouldn’t say a weakness, but it’s more about it takes time to curate such data, because as you’re getting from the data submitters, data standardization takes some time, data harmonization. So I would say a little bit time consuming, so we are not able to right away share the data. It takes a little bit of time when you are trying to integrate the data coming from different sources. So I would say I would put a lot of emphasis that it takes… And I would like to give the credit to the data curators, those who are trying to standardize the data, and then I am more on the receiving end where I was showing you all these projects, where we were interpreting the data. But data standardization and data harmonization is the key to all these data mining projects.

Lisa Scimemi:

The next question is how can researchers be incentivized to collect data in a manner that supports repurposing down the road?

Sanchita Bhattacharya:

Right, so we are getting a lot of these questions, and yeah, definitely, as you’ve seen in my acknowledgment slide that I always acknowledge the data providers, because it’s so important that if they don’t share the data… I mean, all this, as I said, it will sit in a silo somewhere. But one way that we are talking about… These are NIH funded programs, so NIH incentivized by… If someone is sharing the data, maybe their proposals somehow will be expedited or some level of when they are scored, data sharing would be one of those which would be taken into consideration. And incentivize also by… When we are writing the papers, the publications, we always acknowledge the data providers, because they are the ones… They were at the front line. They came up with the papers, they generated the data, and then we reuse the data. So we should always be thankful to them, and also sometimes offer the authorship. So we also have included in some of our papers the primary investigators as our co-authors, if that answers the questions.

Lisa Scimemi:

The next question is what portion of the data in these libraries originate from industry-sponsored clinical trials? Thoughts on incentivizing sharing of data by pharmaceutical companies from such trials, at least from the placebo arms and discontinued programs?

Sanchita Bhattacharya:

Right, so you know we are not directly taking the data from the pharma. But these are primarily NIAID-funded, sponsored trials. But some of them I’ve seen are partnered with the industry folks, and so, you will see some of the trials which are partnered between the academia and industry. Those trials go to ImmPort. But if you are looking for solely industry-sponsored trials, there are other websites like Vivli. Vivli is sharing the clinical trials from the pharma. So I would say these are mostly the trials and importer industry and academia collaborations.

Lisa Scimemi:

And so the last question: how does big data affect clinical trials sample management and data reconciliation?

Sanchita Bhattacharya:

I don’t think I am capable of answering that question because I’m more on the receiving end, looking at all the data that is already been curated. In fact, that is more a management question. And we work with CROs. So ImmPort work with the CROs, and so Rho is one of them, and we have many other CROs, so we get the data directly from them. And so it’s more about when the clinical trial… The sponsors agrees, then it goes to Rho and then we just get that from the Rho. So I think I won’t be able to answer that question because it’s more a management decides on the samples. The sponsors, basically.

Lisa Scimemi:

Great, thank you. That is the last question.

John Fremer:

Thank you, Sanchita. And thank you for joining Sanguine for our S3 webinar on Big Data in Clinical Research – Sharing, Dissemination, and Repurposing. For a list of upcoming webinars or to request patient samples, visit sanguinebio.com. Thank you again and enjoy the rest of your day.