1
Biosensor National Special Lab, Key Lab for Biomedical Engineering of Ministry of
Education, Department of Biomedical Engineering, Zhejiang University, Hangzhou, China
2
Zhejiang Sir Run Run Shaw Hospital, Department of Medicine, Zhejiang University, Hangzhou, China
Corresponding author details:
Hao Wan, Ping Wang
Department of Biomedical Engineering Yuquan Campus
Zhejiang University
Hangzhou,China
Copyright:
© 2019 Wu Q, et al. This is
an open-access article distributed under
the terms of the Creative Commons
Attribution 4.0 international License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the
original author and source are credited.
Purpose Lung cancer (LC) is a leading cause of cancer-related morbidity and mortality globally. Exhaled VOCs have been considered as promising biomarkers for LC diagnosis. However, the accuracy of VOCs for LC diagnosis is not high enough due to the interference from benign pulmonary diseases. This study aims to establish an improved lung cancer diagnosis model with high accuracy in distinguishing lung cancer patients from benign pulmonary patients and healthy controls.
Methods: Herein, numbers of exhaled breath samples were analyzed by TD-GCMS, and the power of discrimination of each VOC was evaluate by ROC curve. Optimization was performed by adding related variables and random forests. To explore the biological relationship between selected VOCs and lung cancer, transcriptome data was analyzed by edgeR and DAVIDE.
Results: The VOCs based diagnosis model was optimized through adding variables. To facilitate the sensor measurement, a five variables model with high was established. Based on transcriptome analysis, lung cancer related metabolic pathways were obtained, and some pathways were consistent with biological metabolic processes which generate VOCs in vivo.
Conclusions: Our improved model can accurately discriminate LC patients from other patients and health people, which provides a promising approach for lung cancer diagnosis.
Lung cancer; VOCs biomarker; CT data; TD-GCMS; Model optimization; Transcriptome
analysis
Lung cancer (LC) is continuously a leading cause of cancer-related morbidity and mortality worldwide [1]. However, with early LC diagnosis, LC patients can be cured by surgery treatment and subsequent chemotherapy. Nevertheless, a convenient diagnosis method is still in great demand in clinical applications. Hence, how to develop a good early diagnosis approach is the key requirement in clinical [2]. Volatile organic compounds (VOCs) exhaled by humans contain a lot of information about the condition of the body. In the last decades, an increasing number of studies about detecting and analyzing exhaled VOCs have emerged [3,4], and they considered VOCs as promising biomarkers for non-invasive and convenient LC screening [5]. However, many study samples only contain LC patients and healthy people [6]. These studies are different from the fact in hospital that clinical patients include many other pulmonary non-malignant disease (PNMD) patients. Furthermore, this method of detecting VOCs is limited by its poor accuracy and various VOCs results based on different conditions and methods in different studies [7,8]. To increase the reliability of VOCs in lung cancer diagnosis, expanding sample volume and optimizing diagnosis model are two effective approaches. Zou et al. [9] optimized the diagnosis model through the establishment of training cohort and independent validation cohort to exclude interferences of PNMD. However, outstanding improvements are hardly achieved via simply optimizing in the level of VOCs. Other related variables should be introduced into the diagnosis model. Besides, a mass of VOCs as biomarkers are difficult to develop a highly precise sensor which can be applied in clinical LC preliminary screening.
On the other hand, due to a lack of biological evidence between VOCs and LC, the approach of VOCs for LC screening is still hard to be validated. Transcriptome analysis which can explain the source of specific VOCs may solve this problem. The genetic and genomic changes existing in cancer patients have been recognized broad [10], and these changes can be reflected by transcriptome analysis conveniently. Recently, an increasing number of researches were studying the LC transcriptome data from The Cancer Genome Atlas (TCGA, https://gdc-portal. nci.nih.gov/). Fidler et al. [11] evaluated transcriptome data from TCGA against serum specimens from lung cancer patients and found some protein biomarkers with inferior survival for patients. Since the biological process of cancer is really complicated, only genomic or proteomic perspective is inadequate to explain it. The study of pathway transform in cancer is critical in understanding the disease [12,13]. Lin et al. [14] identified specific pathways like Liver X receptor activation which possibly indicated important differences in cancer cell metabolism. However, to our knowledge, no study on combining VOCs and transcriptome analysis is reported.
In this study, a large amount of exhaled breath samples was
analyzed to select a set of VOCs in all samples for LC diagnosis. A LC
diagnosis model was established based on VOCs and optimized through
adding CT data and other variables using machine learning. At last,
transcriptome analyses as well as subsequently pathway analysis were
employed to explain the biological metabolisms of the obtained VOCs.
The overall flowchart of our research is shown in Figure 1 which
exhibits the VOCs analysis procedure as well as establishment,
optimization and verification of LC diagnosis model. All the details
will be discussed in the following sections.
Figure 1: The integral flow chart of our research-A lung cancer
diagnosis model was established together with optimization and
verification
From 1st January 2009 to 31st December 2016, we continually
collected and analyzed exhaled breath samples of 197 LC patients and
70 PNMD patients from Sir Run Run Shaw Hospital, Hangzhou, China,
and 178 healthy control samples were collected from Department of
Biomedical Engineering, Zhejiang University and Sir Run Run Shaw
Hospital (Part of the data was reported in our previous study [9]. LC
patients were diagnosed on the basis of MRI or CT characteristics [15]
and were confirmed by histology or pathology [16]. Informed consent
was obtained from every subject. The approval was obtained from the
institutional ethics review committee of Sir Run Run Shaw Hospital,
Hangzhou, China (No. 20070525 and ChiCTR-DCD-15007106). The
volunteers’ statistical numbers of gender, age, smoking status and
CT data were shown in Table 1. Different classifications like smoking
status and CT data are introduced to diagnosis model in optimization
process.
Table 1: Statistical information regarding all volunteers
Before breath sample collection, volunteers were asked to stop consuming food, drinks, and smoking for 12 hours and stay in a ventilated room for 30 minutes. In order to collect the same standard breath sample, every subject breathed normally, through a disposable mouthpiece, and room air was also collected as background at the same time. The VOCs in the collected gas samples were concentrated by Tenax TA sorption tubes (Sigma-Aldrich, St. Louis, MO, USA) for subsequent analysis immediately without any storage.
Thermal desorption device (TurboMatrix 300TD, PerkinElmer, USA) was used to release the VOCs from Tenax tubes. With the aid of carrier gas, the released VOCs were transferred to Gas Chromatography and Mass Spectrometry (GCMS) for VOCs separation and qualitative detection. GC procedures: initially, the column was heated to 40℃ and held for 1 min. After that, the column increased to 250℃ with 5℃/min and held for 2 min. MS condition: the temperatures of the interface and iron source in the MS were 250℃ and 200℃, respectively. Scanning charge ranged from 45 to 500 mass in scan mode. The solvent cut-off time was 0.4 min.
Chromatographic peaks with slope greater than 500/min and
peak area higher than 3000 were selected from GCMS chromatogram.
According to the mass spectrometry library (NIST 05 and NIST 05s),
[4] the corresponding VOCs with similarity higher than 90% were
selected for further analysis. Besides, the retention time, Chemical
Abstracts Service (CAS) number and peak area of substances were
extracted from the raw data. To eliminate the interference of air
background, the peak area of substances subtracted the background
air response. The substances which existed in less than 50% samples
were discarded to guarantee valid VOCs for analysis.
The receiver operating characteristics (ROC) curve is defined as a
plot of test sensitivity as they coordinate versus its 1-specificity as the
x coordinate [17]. ROC curve is an effective method of evaluating the
performance of qualitative diagnosis, namely a binary result which is
either positive or negative for the disease diagnosis. We evaluated the
discriminating power of each VOC according to area under the ROC
curve (AUC) which can take on any value between 0 and 1, and a VOC
with an AUC value of 1 indicates that the VOC is extremely accurate for
disease diagnosis. Furthermore, t-test was employed to determine if
a VOC has statistically significant difference between LC patients and
other people. In dichotomous problem, logistic regression was used
to assess the likelihood of falling into one of the outcome categories
based on a set of predictors [18]. Herein, binary logistic regression
analysis was applied to establish a diagnosis model for LC diagnosis
based on VOCs results from previous analysis. Random forests
(RF), as a powerful machine learning model with high accuracy in
statistical classifier was used to optimize the diagnosis model [19].
Moreover, it is worth stressing that RF enables determining variable
importance by mean decrease accuracy (MDA), which can be used
to decrease variables for convenient sensing in clinical applications
without sacrificing the accuracy of the model.
In order to explore the biological source of VOCs in LC patients, we obtained RNA sequencing data, including LC patients and healthy control from TCGA. To get higher accurate differentially expressed genes (DEGs) results, more healthy control cases are required for comparison. Hence, we downloaded RNA-Seq gene expression data for which all alive cases were available in TCGA database.
EdgeR package from R programming language was used for transcriptome data analysis to identify DEGs between cancer samples and healthy samples [20]. This classical method is based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. We used false discovery rate (FDR) to define DEGs’ threshold by the Benjami and Hochberg (BH) method. Genes with FDR less than 0.001 and mean gene expression fold-change larger than 4 were identified as the DEGs. For gene enrichment analysis, DAVID 6.8 was used for pathway analysis of gene sets [21]. The pathway analysis provides a comprehensive functional annotation to understand the biological significance behind large list of genes.
VOCs selection
According to the aforementioned screening method, VOC profiles of every participant were obtained, and the statistical numbers are shown in Table 2. On average, no significant difference of VOC numbers was observed from LC, PNMD and healthy control. In order to discriminate LC patients, we divided all samples into LC group and non-LC group (including PNMDs and healthy controls). After eliminating VOCs with low occurrence frequency, we obtained 174 ubiquitous VOCs and plotted ROC curves for everyone. Then 70 VOCs with p<0.05 were selected for t-test, and 31 VOCs that have significant difference between two groups were selected shown in Table S1. To evaluate the discriminating performance of the selected biomarkers, we used 31 VOCs as independent variable to establish the diagnosis model by binary logistic regression analysis and the ROC curve was shown in Figure 2a. The datasets were randomly divided into two sets: training set (70% of the datasets) and validation set (30% of the datasets). Based on the established model, the sensitivity, specificity, and overall accuracy were 80.3%, 71.0% and 75.1%, respectively. Table 3 shows the discrimination results of the model for LC, PNMD and healthy control. The established model has high accuracy in distinguishing LC and healthy control. However, the model is unable to effectively discriminate between LC and PNMD due to the very low specificity, and the results are in accordance with previous study [9]. Due to the interference of PNMD patients, the diagnosis model should be optimized further for accurate LC diagnosis.
Diagnosis model optimization
Since the established diagnosis model is unable to effectively discriminate LC patients from PNMD patients, more variables that explicitly relate to LC should be introduced into the model to increase its discrimination capability. CT scan is commonly used for detecting changes in the lung parenchyma in clinical, and it was reported that the diagnostic accuracy of PET-CT for lung cancer was 93.5%, and the false positive rate was 6.5% [22] (the ROC curve of CT data only was shown in Figure 2b and the sensitivity, specificity, and overall accuracy were 85.8%, 91.1% and 88.8%, respectively). In addition, LC is closely associated with gender [23], age [24] and smoking status [25] of patients. Therefore, we established a new 35 variables diagnosis model by combining variables including CT, gender, age, smoking status and previous 31 VOCs through binary logistic regression analysis. Consequently, the sensitivity, specificity, and overall accuracy were significantly improved to 89.3%, 89.5% and 89.4%. The ROC curve of optimized LC diagnosis model is showed in Figure 2c and the AUC was 0.957 which means this model have a superb power for discriminating LC group and non-LC group.
Developing a sensor for monitoring exhaled breath that can replace GCMS is of great significance in clinical. However, simultaneous detection of multiple substances is a big challenge for sensor development and applications. Hence, reducing the input variables and maintaining high accuracy was imperative in the model optimization. Previous 35 variables were used to establish random forests LC diagnosis model with 10 nodes and 600 decision trees. Then, top five variables with the threshold of MDA>10 (Table 4) were selected to re-establish the diagnosis model. The ROC curve was shown in Figure 2d and the sensitivity, specificity, and overall accuracy of the model for LC diagnosis were 88.7%, 90.1% and 89.4%, respectively. As a result, the five variables LC diagnosis model established by RF has similar accuracy comparing with 35 variables model. Therefore, the final optimized model for LC diagnosis is obtained based on three selected VOCs, CT data and age. The largely reduced variables in this model enables convenient sensor development for clinical LC screening.
In Table 5, we summarized the discrimination accuracy of three lung cancer diagnosis models in LC vs. NLC. First, 31 variables model is the worst as the interference of PNMD patients. Then, 35 variables model has a superb power for discriminating LC group and non-LC group through adding LC relevant variables. At last, 5 variables model has similar accuracy
The source of VOCs in vivo
VOCs detected in exhaled breath were generated by in vivo metabolism. Due to the insolubility of VOCs in the blood, the measurable VOCs can be reflected by breath exchange via the lungs. In Table 6, three selected VOCs above are divided into two groups and their main biological reaction processes as well as relative enzymes are presented.
Alkylbenzene is considered to be produced due to exogenous influences include smoking, alcohol and pollution. These exogenous substances leak into the cytoplasm and cause peroxidative damage to proteins, fatty acid and DNA. Most LC patients have a long smoking history, and toluene increased in the breath of smoking patients versus that in nonsmokers [26]. The defense mechanisms in the human body can eliminate exogenous substances by the cytochrome p450, glutathione S-transferases, sulfotransferases, and N-acetyltransferases enzyme system [27].
Alkane is generated by oxidative stress response of polyunsaturated fatty acids (PUFA) in cellular membranes. This process proceeds by a free radical chain reaction mechanism. In LC patients, cytochrome p450 enzymes which catalyze the oxidation of organic substances involve in the emission of VOCs via hydroxylation [28]. Pentane or ethane has been widely used as a sensitive and noninvasive indicator of lipid peroxidation in vivo [29].
Identification of DEGs in LC and enrichment analysis
To explore the changes of metabolism between LC patients and healthy controls, we compared gene expression levels between 347 LC samples and 22 normal samples from TCGA. After R statistical analysis (the code of R language can be found in supplementary materials), edgeR finally obtained 1159 DEGs. We set strict cutoffs with fold-change larger than 4 and FDR less than 0.001 to get highly significant genes which can distinguish the two groups. Table S2 lists the genes whose expression is significantly higher or lower in LC patients than in normal samples. The up-regulated genes were about twenty times larger than down-regulated genes, 1106 and 53 respectively. Figure S1 presents the Volcano Plot which shows the distribution about logFC (fold-change) and FDR of all genes with upregulated genes (red) and down-regulated genes (green).
To understand the roles of DEGs playing in vivo of human body, we performed KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and gene ontology (GO) terms enrichment analysis using DAVID to identify the biological process that enriches the DEGs. As a result, top ten ranked KEGG pathways were closely associated to lung cancer, especially alcoholism, nicotine addiction, metabolism of xenobiotics by cytochrome P450 (Table S3). Meanwhile, for the GO analysis, we categorized significant 30 terms in Figure 3. Obviously, these GO terms were significantly involved in cancer related general process such as DNA replication and repair, phosphorylation, and immune response.
Verification of VOCs through biological evidence
Kinds of genes encoded the key enzymes presented in Table 6, and those genes involved in the pathways which yielded specific VOCs were all differentially expressed in lung cancer samples (Table S2). For example, SULT4A1, GSTA8P, GSTA9P, NAA11 and NAALADL2-AS2 encode glutathione S-transferases, sulfotransferases, N-acetyltransferases, respectively, which play important roles in the production of alkylbenzene. These genes were all significantly upregulated, and the logCPM were 4.6, 6.0, 5.6, 6.0 and 5.2 respectively. It means that the pathway of body’s defense mechanism is more intense in LC patients, and excessive products would be released into the extracellular environment and enter the lung through blood circulation. This is the reason why exhaled breath from LC patients has more alkylbenzene such as 3-Ethyltoluene and 1,2,3-Trimethylbenzene.
Cytochrome p450 is a super enzyme family including various protein isoforms, and uses various molecules as substrates in enzymatic reactions. This kind of enzyme is encoded by varies of genes like CYP1A2, CYP24A1, CYP26A1, CYP4F11, CYP4F3 and CYP2AB1P. According to our analysis results, CYP1A2 was downregulated (logCPM=-4.4) and other five genes were up-regulated. It implies that this enzyme catalyzes different kinds of reactions involved in LC. Cytochrome p450 oxidizes a variety of structurally unrelated compounds, including steroids, fatty acids, and xenobiotics. For instance, CYP26A1 plays a key role in retinoic acid metabolism [30], CYP4F11 plays a key role in vitamin K catabolism by mediating omega-hydroxylation of vitamin K1 and K2 [31], and CYP4F3 catalyzes the omega-hydroxylation of leukotriene-B4 [32]. Importantly, alkane and alkylbenzene are the metabolites of these reactions. When CYP gene family was highly up-regulated, a mass of VOCs like methylcyclohexane would be generated apparently in vivo. For the down-regulated gene, CYP1A2 involves in the metabolism of aflatoxin B1 and acetaminophen [33]. Therefore, when CYP1A2 was down-regulated in vivo, it’s reasonable to cause the retardation of related metabolism. In other words, aflatoxin B1 and acetaminophen would accumulate in the body or flow through other metabolic pathways which may generate the VOCs like 3-ethyltoluene and 1,2,3-trimethylbenzene as they all containing benzene ring. Consequently, alkylbenzene VOCs can be detected in the exhaled breath.
Figure 2: The ROC Curves-(a) 31 variables (Table S1) model, (b)
CT data only model, (c) 35 variables (Table S1 add CT data, gender,
age and smoking status) model and (d) 5 variables (Table 4) model.
And the AUCs is 0.824, 0.911, 0.957, and 0.936 respectively
Figure 3: List of gene ontology terms analyzed by DAVID-The terms was order by the value of fold-change which represent statistical
weight of DEGs in LC and black columns indicated the gene number involved in each GO term
Figure S1: The Volcano Plot of all DEGs-It is a type of graph used to
relate fold-change to FDR. Each point represents one gene. Red and
green points mean up-regulated genes and down-regulated genes
respectively
Table 2: VOC numbers in exhaled breath of LC, PNMD and healthy
control groups
Table 4: Top five variables with MDA>10 in RF model
1
These 31 variables are showed in Table S1
2
These 35 variables are Table S1 adding gender, age, smoking status and CT data
3These 5 variables are showed in Table 4
Table 5: The discrimination accuracy of different lung cancer diagnosis model in LC vs.NLC.
Table 6: Pathways which generate the selected VOCs and its relevant enzymes
Table S1: VOCs which had significant difference between LC group and Non-LC group
Table S2: The down-regulated genes in lung cancer
1
Relevant genes were represented by Entrez gene ID
Table S3: Dysregulated pathways in LC patients and its relevant DEGs
In conclusion, our work selected a set of VOCs from exhaled
breath samples through GCMS. Besides, we established a LC
diagnosis model based on 31 special VOCs and then improved
sensitivity, specificity, and overall accuracy significantly to 89.3%,
89.5% and 89.4%, respectively, by adding LC relevant variables
of gender, age, smoking status and CT data which can distinguish
LC patients from non-LC people (including PNMDs and healthy
controls). In order to develop sensor which can be widely used in
clinical, we established an optimized five variables model with
similar accuracy. Furthermore, to identify the source of VOCs, LC
related metabolic pathways were obtained, and some pathways
were consistent with biological process which generated VOCs in
vivo. Overall, we established two optimized LC diagnosis model and
illuminated the relationship between LC and VOCs.’
This research was supported by projects of Natural Science
Foundation of China (No. 31571004, 31627801). And we thank all the
volunteers for this study.
The authors declare they have no conflict of interest.
All procedures performed in studies involving human
participants were in accordance with the ethical standards of the
institutional and/or national research committee and with the 1964
Helsinki Declaration and its later amendments or comparable ethical
standards. Ethical approval for human exhale breathes collection
was obtained (The institutional ethics review committee of Sir Run
Run Shaw Hospital, No. 20070525 and ChiCTR-DCD-15007106).
Informed consent was obtained from all individual participants
included in the study.
Copyright © 2020 Boffin Access Limited.