jonsv89/Disease_PERCEPTION
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
## @@ @@ @ @ @@ @@ ## ## Readme document ## ## @@ @@ @ @ @@ @@ ## In this study, we have downloaded all the human microarray samples available at Gene Expression Omnibus and ArrayExpress analyzed using the same platform (HG-U133Plus2). We have considered only those studies analyzing at least one disease and providing case and control samples. Cell-lines and treated samples have not been taken into account. In summary, we have analyzed 6,284 cases and 3,887 controls, meaning 67.2 Gb of information. Since all the data is publicly available in the above mentioned databases, we here provide two documents: - "Raw_data.txt": shows the way the raw_data is organized - "Raw_data_directories.txt": provides the complete path of the raw_data - "New_Raw_data.txt": shows the way the new raw_data is organized - "New_Raw_data_directories.txt": provides the complete path of the new raw_data Together with the raw data, we also need the following documents: "Data/Remove_Symbols_u133plus2.Rdata" --> Sánchez-Valle et al. 2017 "Data/Gene_Symbols_u133plus2.Rdata" --> Sánchez-Valle et al. 2017 "Epidemiological_networks/Disease_pairs.csv" --> Jensen et al. 2014 NatComm "Epidemiological_networks/PDN_3_digits.net" --> Hidalgo et al. 2009 PLoS CompBio "Patients_information_table.txt" --> Manually generated. Cluster information is added once the subgroups are generated. "Data/BIOGRID-ALL-3.4.164.tab2.txt" --> This file is too big, it should be downloaded from the web: https://downloads.thebiogrid.org/BioGRID/Release-Archive/BIOGRID-3.4.164/ file: BIOGRID-ALL-3.4.164.tab2.zip ## Scripts to be run ## ## @@ @@ @@ @@ @@ @@ ## 1-Generate_patient_patient_similarity_network.R This script reads the cell files (raw data) and generates normalized expression matrixes for each disease in each study. Differential expression analyses are conducted using limma, comparing both all the cases against all the control samples (for each study separately) and each single case against all the controls from the same study (generating patient-specific differential expression profiles). Selecting different numbers of genes as up- and down-regulated (100,200,300,400,500,1000,2000,3000,4000,5000), calculates the number of intra-disease interactions between patients using a FDR<=0.05 as threshold for the fisher's exact test, generating Supplementary Figure 8. Then, 1000 "fake patients" are generated randomly selecting 500 genes as up- and down-regulated. Patient-patient similarity networks are generated applying different thresholds. This process is repeated 100 times, registering the number of interactions identified with each of the thresholds. The threshold that allows the identification of 0 interactions (meaning that the use of this threshold gives back 0 interactions between fake patients) is used as the threshold for the generation of the real patient-patient similarity network. In this step, Supplementary Figure 10 is generated. In the selection of the top 500 up- and down-regulated genes, disease, ICD9 and ICD10 similarity networks are generated (calculating relative risks based on patient-patient connections) using each of the tested thresholds, calculating then the overlap between the "molecular-based" similarity network and the ones generated based on electronic health records. 2-Generate_subgroups.R This script extracts subgroups of patients for each disease separately, calculating the number of genes deregulated in the same direction in all the patients composing the subgroup, and comparing this number of "commonly deregulated genes" to the number of genes commonly deregulated obtained when randomly assigning patients from the same disease to a subgroup of the same size (this process is repeated 1000 times). Only those subgroups with more shared genes than the expected by chance are considered for the rest of the analysis. 3-Generate_networks.R This script generates disease, ICD9 and ICD10 interactions networks and compares ICD9 and ICD10 networks with the ones generated by Barabasi's and Brunak's groups, and calculates the significance of the overlap. This scripts generates Supplementary tables 1 and 2, the overlap between our molecular-based ICD10 interaction network and the disease trajectories generated by Jenset et al. 2014 represented in Figure 2 and the intra-disease patient-patient similarity network represented in Figure 3. 4-Analyze_networks.R This script calculates the mean number of subgroups per disease and the mean number of patients per subgroup. It looks for the commonly deregulated genes in Alzheimer's disease and NSCLC subgroups, selects the genes potentially involved in the interactions between subgroups, goes to biogrid and selects the first neighbours of the genes and conducts enrichment analyses using gprofiler on the expanded list of genes. Genes potentially involved in the interaction between subgroups means that, if two subgroups are positively connected, we look for the genes that are deregulated in the same direction in all the patients composing both subgroups, while in the case of negative interactions we look for the genes that are up-regulated in all the patients from one subgroup and down-regulated in all the patients from the other subgroup, and vice versa. Calculates the number of clusters composed by patients from different studies, number of diseases composed by subgroups composed by patients from different studies, the percentage of true subgroups composed by patients from different studies, the percentage of newly detected interactions in the size 4 subgroups scn at the disease level not detected in the DMSN It plots the disease-disease interactions network in a heatmap way (which will be used to plot the Supplementary Figure 1. 5-Enrichments.R This script looks for the genes involved in all the subgroup interactions (as done in the previous script but for all the diseases and not only for Alzheimer's disease and NSCLC), expands the list of genes looking for the first neighbours in biogrid and conducts enrichment analyses. 6-Generate_patient_identifiers.R & 7-Patient_specific_RR.R These two scripts are used to calculate the relative molecular similarity of each single patient with each of the diseases under study. It gives us back the percentage of patients with Alzheimer's disease that present a significant negative relative molecular similarity with NSCLC, and vice versa. Finally, it's used to detect the drugs that could potentially increase the risk of developing specific secondary diseases in specific patients, like the case of cyproterone. 8-Intra-vs-Inter_interaction_percentages.R This script represents the intra- vs. inter-disease/subgroups interaction percentages for all the diseases, neoplasms and mental disorders and diseases of the nervous system and sense organs. 9-Patient_classification.R This script classifies each patient into their corresponding disease and subgroup using a leave one out approach based on patient's differential expression profiles. 10-New_patient_classification.R This script calculates patient-specific differential expression profiles for new patients and classifies them into the most probable disease based on the similarities of their differential expression profiles with the ones of the original patients. 11-DB_generator_disease_PERCEPTION.R This script generates all the tables needed for the Disease PERCEPTION portal.