**Readme for Q-HTCP analysis using current (2/17/20) practices-** **![][image1]** **Figure 1\. Flow diagram for Q-HTCP analysis.** Sections describing steps in current Q-HTCP analysis are indicated numerically. **\*\*\* WHAT FOLLOWS IS OUT OF DATE- WE ARE CURRENTLY USING A SET OF SHELL SCRIPT THAT CALLS THE Rscripts DESCRIBED BELOW AND PRODUCES THE DIRECTORIES NEEDED TO LOCATE THE INPUT FILES AND TO RECEIVE THE OUTPUTS. SEE GOOGLE DOC “Copy of Multi-experiment Study Analysis System Guide\_jh24\_0718”** **\*\*\*Highlighted words indicate places in your code/script that you will need to modify the name to match the files, path, or description of your particular experiment\*\*\*** **It may also be possible to implement Rprojects in order to avoid the need to change file paths. The advantage of Rprojects is that the folder containing the files is the top of the directory, thus the same directory structure can be used on different computers to call and write files from the Rscripts. [seansantos18@gmail.com](mailto:seansantos18@gmail.com)**is this description correct? could this work on the server just as well as it works from our laptops?** [**seansantos18@gmail.com**](mailto:seansantos18@gmail.com)**also wanted to ask if it is necessary to use FileZilla to move files around on server from Mac, whereas it is possible to use keyboard shortcuts (cont-C / cont-V) on PC, because there is not “middle click” on a Mac?** Prior to performing interaction score analysis, use John Rodgers Matlab software to generate the \!\!results.txt file. Once the \!\!results.txt file is generated view the experiment using John Rodgers’ software; open Matlab 2014 and the most current version of EZView. Run the EZView.mat program and click “Add to path” when prompted. Navigate to /media/data/ExpJobs/YourJobName/Results-date/matResults/date\_reg.mat to view your plates. Look for plate contamination, plate effects (such as much faster/slower growth than normal, or areas with no growth, gradient from the middle – likely did not mix drug well if seeing this), missing image time points, dropped plates, darker images, and curve fits that do not match visual spot growth. When satisfied with the image results and the Matlab analysis, move on to generating the interaction scores. **1.0 Generate interaction scores for genome wide experiments from a \!\!results.txt** 1) Create folder on the server for the new experiment (On server, for most experiments, I’ve used the path /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**Folder**/) It is also possible to create new directory using command line “mkdir” Example: ![][image2] 1) Copy \!\!results.txt (results sheet from matlab) to folder (add path to \!\!results.txt files) 1) Copy lm\_interaction\_drug\_all\_V5\_do\_not\_print\_avg\_Z.R (interaction score Rscript) to folder (this is located in the Q\_HTCP\_Analysis Folder) 1) Open the terminal within /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**Folder**/ path or use command line to navigate to your new folder using “cd” Example: cd /media/data/Santos\_Sean/Q\_HTCP\_Analysis/Folder/![][image3] 1) Run the interaction scores Rscript from the command line using the following: Rscript lm\_interaction\_drug\_all\_V5\_do\_not\_print\_avg\_Z.R \!\!results.txt ExperimentName\_Analysis1/ ![][image4] ***Modified scripts** can be used to exclude certain data types –* an example we performed was using the JS\_19\_1224 P53 data where the \!\!results.txt contained both WT and 2KR data. We wanted to include only the 2KR data and ran the following modified Rscript:![][image5] In the above example we used a slightly modified version of the Rscript where we select a subset of the data from \!\!results.txt using the modifications as seen below: ![][image6] In the added line of the above R script (last line, shown as opened in the gedit text editor), we subset our data.frame using the specifics column and specify to only include rows where “2KR” is in the specifics column. The interaction scores Rscript will generate several files as seen below: ![][image7] The **ZScores\_Interaction.csv** file is the primary output containing the interaction scores and is used as the input output file for most of the downstream analysis. ![][image8] 1) After interaction scores are generated, **check the Frequency\_delta\_background.pdf file** in the QC folder to identify data that could possibly be contaminated. The default setting will remove any strains where the delta background is 3SD from the mean, but **if there is no or almost no contamination, you may need to set this threshold higher to 5 or above**. Find the following lines in the lm\_interaction\_drug\_all\_V5\_do\_not\_print\_avg\_Z.R and change 3\*sd to 5\*sd or a higher value. Use **Save As** (not just save) to rename the lm\_interaction\_drug\_all\_V5\_do\_not\_print\_avg\_Z.R to something such as lm\_interaction\_drug\_all\_V5\_do\_not\_print\_avg\_Z\_deltabackground5SD.R. ![][image9] **1.1 REMc** **\*\*\* WHAT FOLLOWS IS OUT OF DATE- THE REMc PROCESS IS NOW INCORPORATED IN A JAR FILE AND IN A SHELL SCRIPT THAT ALSO CALLS Rscripts TO ARRANGE DATE. SEE GOOGLE DOC “Copy of Multi-experiment Study Analysis System Guide\_jh24\_0718”** At this stage, we want to compare multiple experiments to each other to identify phenomic modules of interest. Recursive Expectation Maximization clustering (REMc) can be used to cluster interaction scores and then perform biological enrichment analysis. \-See doi: [10.1063/1.3455188](https://doi.org/10.1063/1.3455188) for the Chaos paper and Jingyu Guo’s readme.V2 to set up Eclipse. 1) **Merge the files**. This can be done using **R or Excel**; **if you want to use R see appendix 3.1 – you will need to use R if the total number of genes between experiments are not the same\!** Using Excel, open the first **ZScores\_Interaction.csv** file of interest. \*\*\***Sort by OrfRep\*\*\***. **Copy** the **OrfRep, Gene, Z\_Shift\_L, Z\_lm\_L, Z\_Shift\_K, Z\_lm\_K** columns into a new excel sheet. Rename the **Z\_Shift\_L, Z\_lm\_L, Z\_Shift\_K, Z\_lm\_K** to have descriptive names preceding the current column names for example **Gem\_Z\_Shift\_L, Gem\_Z\_lm\_L, Gem\_Z\_Shift\_K, Gem\_Z\_lm\_K**. Open the next **Zscores\_Interaction.csv** file of interest, \*\*\***Sort by OrfRep (don’t forget to do this or the ORFs will not match\!)\*\*\***, and repeat the same steps to add additional experiment columns to the new Excel sheet. Give different descriptive names. For example: **Cyt\_Z\_Shift\_L, Cyt\_Z\_lm\_L, Cyt\_Z\_Shift\_K, Cyt\_Z\_lm\_K**. Repeat this process for as many experiments you want to compare in REMc. I generally reorganize the excel sheet at this step so that the order of columns is how I want to output the heatmaps later on (i.e. All K values to the left, all L values to the right, alternate Z\_Shift and the Z\_lm). I also generally save this file as “Date\_Description\_ShiftInt\_WithNAs.csv” (can also save as excel (.xlsx) format – doesn’t really matter yet). Below is an example of how this file looks (using gemcitabine (Gem) and cytarabine (Cyt) as an example. ![][image10] 1) **Substitute the NAs**. For REMc to work correctly and to allow proper heatmap display we will substitute the NA values with small, non-zero values. Open the “Date\_Description\_ShiftInt\_WithNAs.csv” file in excel and **select only the “Shift columns”**. Perform a **find and replace** using find “NA” and replace with “0.001”. ![][image11] Next, replace the NAs in the Z\_lm columns using **find and replace** to find “NA” and replace with “0.0001”. Note the extra significant digit here – the heatmap script cannot draw the dendrogram if all the values for a gene across a row are NAs and will error out. Thus, we want to be able to identify NAs in one set of columns, but also be able to draw the dendrogram by giving some small value that will cluster away from the rest of the data, and so the script only generates “NAs” in rows with the 0.001 value (shift) but will print the Z\_lm value as a white color to indicate NAs in this way. ![][image12] After substitution, the example looks like the following: ![][image13] I generally name this file “Date\_Description\_ShiftInt\_NArem.csv” 1) **Fill empty cells in the gene column by copying over ORF name. *Some ORFs do not have assigned gene names.* **Fill in the ‘missing’ genes** (if there are any – this will depend on the \!\!results.txt file. If there are blanks in the genes column, this can cause issues with REMc and heatmap generation. If it is a genome wide experiment (making sure it is sorted by OrfRep), I will often copy and paste over the Gene column from a previous genome wide experiment with the gene names filled in. However, I also have an R script that can fill in the missing names and updates gene names (and if no name is given it replaces it with the OrfRep). The Rscript is named **14\_0430\_cmd\_MatchGenes.R**. See the following for usage: Rscript 14\_0430\_cmd\_MatchGenes.R input\_file.csv SGD\_features.tab output\_file Arg 1\) Input\_file.csv \-This should be your “Date\_Description\_ShiftInt\_NArem.csv” file generated in step 2\. Make sure that OrfRep or ORF is in column 1 and Gene is in column 2\. Arg 2\) SGD\_features.tab – download this from [https://downloads.yeastgenome.org/curation/chromosomal\_feature/](https://downloads.yeastgenome.org/curation/chromosomal\_feature/) Arg 3\) output\_file \-The name you want to give the file. I generally call this file “Date\_Description\_ShiftInt.csv” ![][image14] We will also need to remove all unusual characters in the gene column after adding the gene names (as above, either by copy/paste or by 14\_0430\_cmd\_MatchGenes.R). Open the newly created file (“Date\_Description\_ShiftInt.csv”) in excel and use the find function to **look for the following characters** – comma (**,**) or asterisk (**‘**) and **replace the gene name with the OrfRep** in all instances. Also search for **YKL134C** – in the corresponding gene column, excel will auto change this to a date (October 1). Change the gene name to YKL134C to prevent this conversion at any step. 1) **Select only the L and K z-scores above |2| for input into REMc** (if you want to cluster all the data you can skip this step, but in general selecting only the values above 2SD tends to clean up the clustering). I generally do this as an “ad-hoc” analysis in R and will give example code below. Open RStudio and create a new R Script. You can copy and paste the following into R and will only need to modify bolded parts (see below): X \<- read.csv(file="**filePath/Date\_Description\_ShiftInt.csv**",stringsAsFactors \= FALSE) X \<- X\[abs(X$**DescName1**\_Z\_lm\_K) \>= 2 | abs(X$**DescName1**\_Z\_lm\_L) \>= 2 | abs(X$**DescName2**\_Z\_lm\_K) \>= 2 | abs(X$**DescName2**\_Z\_lm\_L) \>= 2,\] write.csv(X,file \= "**filepath/Date\_Description\_ShiftInt\_Above2SD.csv**",row.names \= FALSE) ![][image15] \-change “**filePath/Date\_Description\_ShiftInt.csv**” to the file path to your file and then the name of your file. Use tab within the quotation marks in RStudio to help locate files. \-Change “**DescName1**” and “**DescName2**” to whatever descriptive name you gave your headers in Date\_Description\_ShiftInt.csv (i.e. could be “Gem” and “Cyt”) \-replace “**filepath/Date\_Description\_ShiftInt\_Above2SD.csv**” with a path to the file you want to create. \-Save and run the above R script with a descriptive name such as “Date\_Get2SD\_for\_ExperimentDescription.R” in case you need to go back and see what you did. 1) **Generate files for REMc** and for later adding the shift values back into the heatmaps (we don’t cluster the shift values but are interested in using them at the heatmap stage to identify genes where deletion results in a large initial shift indicating “sick” strains). \-Open the “Date\_Description\_ShiftInt\_Above2SD.csv” created in step 4 (or “Date\_Description\_ShiftInt.csv” generated in step 3 if step 4 was skipped and you want to use all genes regardless of z-score for clustering) in Excel. i) **Remove** all of the **columns with shift values** and save the file as “**Date\_Description\_REMcReady.csv**”. This file is used as input for REMc. ii) Reopen “Date\_Description\_ShiftInt\_Above2SD.csv” and **remove** all the **columns with Z\_lm scores**. Save this file as “Date\_Description\_Shift.csv”. This file will later be used to add the shift values back when generating heatmaps associated with REMc generated clusters. 1) **Perform REMc**. Copy the “**Date\_Description\_REMcReady.csv**” file generated in step 5 to your eclipse-workspace/REMc/ directory. **Open eclipse** by double clicking on the eclipse icon in the eclipse folder in your home folder. Once in eclipse, **select “Run” and then** “**Run configurations**” from the top menu. ![][image16] Next, select the Arguments tab and look to see that you have the following: Program arguments: Date\_Description\_REMcReady.csv GeneByGOAttributeMatrix\_nofiltering-2009Dec07.tab ORFs\_w\_DAmP\_list.txt 1 true VM arguments: \-Xms8000m \-Xmx8000m ![][image17] Once set, you will only need to alter the first line in program arguments to the name of your **Date\_Description\_REMcReady.csv** file. Select **apply** and then **run** in the bottom right corner to perform REMc on the selected file. Look to see if any errors are generated after hitting run (common errors here are due to unexpected characters in the ORF or Gene columns, or non-numeric values in the other columns (such as “NA”). If REMc is running, in the eclipse-workspace/REMc/ directory you should see a file called **Date\_Description\_REMcReady.csv-WholeTree.csv**. This file will be updated every few minutes as new clusters are generated. After completion of REMc, you should have **Date\_Description\_REMcReady.csv-WholeTree.csv**, **Date\_Description\_REMcReady.csv-finalTable.csv**, **Date\_Description\_REMcReady.csv-summary.csv**, **Date\_Description\_REMcReady.csv.arff**. Copy these files to a new directory with a descriptive name to perform REMc (such as **Date\_Description\_Clustering/**). **1.1.1 GTF (Gene Ontology Term Finder)** GTF will look for enriched gene ontology terms in the REMc clusters. Several files are required to run GTF and they must be copied to the working folder (I usually create a new folder to contain all of the REMc/GTF files) to run GTF. 1\) **Make a new folder, e.g., ‘Clustering’, in the project folder and** **Copy REMc and GTF files to the new folder**. (Most of Sean’s Clustering results are on Data2/Santos\_Sean/Documents/Hartman\_Lab/ACS\_project/). Once REMc is finished, which is described in the previous section, copy the following files to your working folder **Date\_ResultsDescription\_REMcReady.csv-finalTable.csv**, **Date\_ResultsDescription\_REMcReady.csv-WholeTree.csv, Date\_ResultsDescription\_REMcReady.csv-summary.csv**. Next, copy to your working folder the files located in the following directory on the server: **/media/data/Santos\_Sean/GTF\_Files/** ![][image18] If the copy/paste function doesn’t work for you on the server from Xquartz (we’ve had issues with this on Mac OS), then use FileZilla to copy the files to your computer (files in **/media/data/Santos\_Sean/GTF\_Files/**) and then you can copy the files from wherever you save them on your local computer to your working folder on the server. It’s good practice to make a backup of the files off of the server, but one can skip this step by copying them to the server desktop (which exists on a different drive) and then copy back again to the desired folder (dragging between folders on the same drive moves rather than copies the files). In your working directory, create folders named “Process”, “Function”, and “Component” and we will perform GTF for each of these ontologies in these folders. Copy the GTF files into these folders. (The GTF files can be removed from these folders after GTF is complete to save space). Copy **DconJG.py and AddShiftVals.R** into the parent folder as well (needs to be in same directory as the **.csv-finalTable.csv** file. In the example of P53\_NoDamps (involving HLD, HLEG, WT and 2KR), there were 5 different comparisons, with component, function and process GTF for each. Also note, ‘Pairwise\_Comparisons’ and ‘Multiple\_Comparisons’ folders are GTA only. 2\) **Run the DconJG.py script**. GTF requires files to have a specific format and to generate text files for each cluster in this format we can apply the DconJG.py script. To run this script, open the command line in your working folder and apply the script in the following format: **python DconJG.py Date\_Results\_REMcReady.csv-finalTable.csv cluster\_origin\_column\_num output\_path\_name** \-the cluster\_origin\_column\_num tells the script where the header/column containing “cluster\_origin” is located in the \-finalTable.csv file. (i.e. – is it in the 8th column, 10th column, etc.?) ![][image19] \-I usually put a “./” as the argument for “output\_path\_name”, which makes a folder in my working directory with the same name as my input \-finalTable.csv file. Copy the resulting folder containing the .txt files for each cluster into the “Process”, “Function” and “Component” directories. 3\) **Perform GTF**. In the process folder in your working directory, run the “**Process\_example\_v4.sh**” script: **./Process\_example\_v4.sh ORFs\_w\_DAmP\_list.txt Date\_Results\_REMcReady/\*.txt** \-If necessary, substitute “ORFs\_w\_DAmP\_list.txt” with the proper background file containing a list of ORFs to include as the background for GTF. For example, if DAmPs were excluded use the 17\_0503\_ORF\_list\_without\_DAmPs.txt file. ![][image20] In the Function and Component folder, run the “**Component\_example\_v2.sh**” and “**Function\_example\_v2.sh**” scripts in their respective folders using the same arguments as the Process\_example\_v2.sh. \*\*\* if you do not run these scripts in separate folders, you will overwrite the results from one ontology with another\*\*\* In your **Date\_DescriptionOfResults\_REMcReady/** folder, new .terms.txt and .tsv text files should be generated ) and GTF will run through all .txt files for clusters (can take 10 mins to an hour depending on how many total clusters). The example below shows the files created after running GTF. ![][image21] 4\) **Concatenate the GTF results.** Use the Concatenate\_GTF\_results.py file to concatenate the GTF results for each ontology by running this script in the folder for each respective ontology. The script can be used in the following way: python Concatenate\_GTF\_results.py Date\_DescriptionOfResults\_REMcReady/ Date\_GTF\_Results\_**Ontology**\_DescriptionoOfResults.txt ![][image22] \-Make sure to change “Ontology” in the above line to either “Process”, “Function”, or “Component”. Lastly, I prefer to compile the concatenated results sheets into one excel document. To do this, I will copy the files to my computer using FileZilla and then copy and paste the tables to separate worksheets in the same Excel document and save this file as Date\_GTF\_Results \_DescriptionOfResults.xlsx **1.1.2 Heatmap generation for REMc**. Automatic generation of heatmaps from REMc clusters can be performed using the Rscript “**18\_0205\_heatmaps\_zscores\_2SD\_color\_NARem\_Z\_lm.R**”. This file can be found in **/media/data/Santos\_Sean/Rscripts/REMc\_Heatmaps/** Before we make heatmaps, we prefer to add the shift values back with the interaction scores in Date\_Description\_REMcReady.csv-finalTable.csv. 1\) **Add back the shift values**. Use “**AddShiftVals.R**” to add the shift values back to the Date\_Description\_REMcReady.csv-finalTable.csv. Use the script in the following format: Rscript AddShiftVals.R Date\_Description\_REMcReady.csv-finalTable.csv Date\_Description\_Shift.csv Date\_Description\_REMcReady.csv-finalTable**WithShift**.csv . Note: ‘Date\_Description\_Shift.csv’ was saved when the shift values were removed just prior to REMc (see 1.1\_5)). This will generate a file with both the interaction scores and the shift values in the finalTable.csv format. However, we will need to **reorganize** the Date\_Description\_REMcReady.csv-finalTable**WithShift**.csv so that each shift values is correctly ordered to the left of its associated Z\_lm interaction score. Open the Date\_Description\_REMcReady.csv-finalTable**WithShift**.csv file in excel and manually reorder the shift values to their correct location. 2\) **Run the heatmap script**. Use **18\_0205\_heatmaps\_zscores\_2SD\_color\_NARem\_Z\_lm.R** to generate the heatmaps. Use this script as described below: Rscript 18\_0205\_heatmaps\_zscores\_2SD\_color\_NARem\_Z\_lm.R Date\_Description\_REMcReady.csv-finalTable**WithShift**.csv Heatmaps/ \-this will create a folder called “Heatmaps” in your directory with PDFs for all REMc clusters. 3\) **Concatenate the heatmaps into one PDF**. Use pdf tool kit to concatenate the heatmaps by running the following line in the command line in your working folder. pdftk Heatmaps/\*.pdf output Date\_Description\_Heatmaps.pdf **1.2** **GTA Analysis**. Gene Ontology (GO) term averaging (GTA) is performed first on a single experiment (1.2.1) and later can be compared pairwise to other GTA experiments (1.2.2) by generating interactive plots. GTA assigns GO terms an average Z score based on the interaction z-scores (L- or K-based) for all genes in that GO term, and then assigns significance to GO terms with a Z-score above |2| after subtracting the standard deviation. In general, we have observed that GTA tends to identify smaller GO terms that may not have been identified by REMc/GTF. **1.2.1 Generate GTA results.** Open terminal in or navigate in the terminal to /media/data/Santos\_Sean/Q\_HTCP\_Analysis/. In this folder we will use the ScoreAllGOTerms\_From\_Z\_lm\_V2.R scriptgen to generate GTA scores. Run this script on your experiment using the following arguments/files: Rscript ScoreAllGOTerms\_From\_Z\_lm\_V2.R Exp\_Name/Analysis1/ZScores\_Interaction.csv go\_terms.tab gene\_association.sgd Exp\_Name/GTA\_Results/ \-Several files will be generated in the specified directory, but the file with all of the GTA results is named “**Average\_GOTerms\_All.csv**” and will be used in the next step of pairwise analysis of experiments. ![][image23] **1.2.2 Pairwise GTA analysis.** Comparing GTA between two experiments will create the interactive plots and associated tables for GTA identified GO terms. This analysis can be performed in the /media/data/Santos\_Sean/Q\_HTCP\_Analysis/ directory using the following scripts: **Compare\_GTF\_Averages\_BetweenScreens\_lm\_v2.R** for L values and **Compare\_GTF\_Averages\_BetweenScreens\_lm\_Kvals\_v2.R** for K values. The two scripts take the same arguments (you only need to specify which script is being used for L or K after writing Rscript in the command line) and can be run using the following: Rscript Compare\_GTF\_Averages\_BetweenScreens\_lm\_v2.R Exp1/GTA\_Results/ Average\_GOTerms\_All.csv Exp1\_Name Exp2/GTA\_Results/Average\_GOTerms\_All.csv Exp2\_Name Pairwise\_Comparisons/Exp1\_vs\_Exp2/GTA\_L/ \-In the Pairwise\_Comparisons directory, you will need to create the folder for your comparison first because the Rscript will only allow you to make one new directory time **1.3 Term Specific Heatmaps** Open terminal to or navigate in the terminal to /media/data/Santos\_Sean/Q\_HTCP\_Analysis/ and the term specific heatmaps Rscripts are in this directory. \-Depending on the number of experiments you want to compare, select one of the following scripts: **GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_V2.R** can be used to compare two experiments, **GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_3terms\_V2.R** for 3 experiments, **GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_4terms\_V2.R** for 4, **GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_5terms\_V2.R** for 5\. ![][image24] Follow the directions for the Rscript appropriate for the number of experiments you are comparing by opening the Rscript in either Rstudio or gedit and looking at what arguments to use. The number of arguments will be greater if a greater number of experiments are being compared. For the GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_V2.R Rscript it can be used in the command line using the following: Rscript GO\_list\_All\_ChildTerms\_lmZscore\_max100child\_Heatmaps\_V2.R Exp1\_ZScores\_Interaction.csv Exp1\_name Exp2\_ZScores\_Interaction.csv Exp2\_Name gene\_ontology\_edit.obo go\_terms.tab All\_SGD\_GOTerms\_for\_QHTCPtk.csv Pairwise\_Comparisons/Exp1\_vs\_Exp2/TermSpecificHeatmaps/ \-The All\_SGD\_GOTerms\_for\_QHTCPtk.csv file can be substituted for another file with a list of GO\_IDs if you only want to run it on certain GO\_terms – I’ve created a few examples in the following path: /media/data/Santos\_Sean/Q\_HTCP\_Analysis/Pairwise\_Comparisons/Query\_Term\_Lists/ \-For greater than 2 experiments, there will be more arguments to load in the Exp\#\_ZScores\_Interaction.csv and the Exp\#\_Name. **3\. Other Rscripts** **3.1 Pairwise Venn Diagrams and CPP Correlation plots to compare two experiments.** I generally perform this analysis right after performing GTA and creating a folder for the two experiments I am comparing in the **Pairwise\_Comparisons** folder. Use the Rscript at the following file path: /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**Compare\_Pairwise\_Overlap\_VennDiagrams\_confined\_to\_matches\_V2.R** Usage (shown as run from /media/data/Santos\_Sean/Q\_HTCP\_Analysis/): Rscript Compare\_Pairwise\_Overlap\_VennDiagrams\_confined\_to\_matches\_V2.R Exp1/Exp1\_Analysis1/Exp1\_InteractionScores.csv Exp1\_Name Exp2/Exp2\_Analysis1/Exp2\_InteractionScores.csv Exp2\_Name Pairwise\_Comparisons/Exp1\_vs\_Exp2/VennDiagrams\_and\_Correlation/ This script will output Venn Diagrams for overlap between the two experiments when comparing enhancers and suppressors (defined as Z \> |2|). It will generate lists for the intersecting genes, and genes that are experiment-specific deletion enhancers or suppressors as .csv tables. It will also perform a CPP comparison by plotting the L, K, R, and AUC values across experiments and print a correlation coefficient (R2) for a linear regression fit. It will also create graph plotting the ranked CPP scores against each other. There will be four output folders but using the updated model we only want to look at the “**lm\_both**” results. **Ignore** the “Avg\_Zscore\_Both (alex’s method)” and the “lm\_exp1\_only” and “lm\_exp2\_only” (these compare alex’s method and the updated linear model method). I can remove these extra comparisons if too confusing. **3.2 CPP analysis for YKO/YKD/RF for one experiment (all CPPs by all CPPs).** Use the Rscript at the following file path: /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**CPP\_Comaprison\_with\_DAmPs\_and\_RF.R** Usage (shown as run from /media/data/Santos\_Sean/Q\_HTCP\_Analysis/): Rscript CPP\_Comaprison\_with\_DAmPs\_and\_RF.R ExpName/ExpName\_Analysis1/ExpName\_InteractionScores.csv ExpName/ExpName\_Analysis1/ExpName\_**RF\_InteractionScores**.csv 16\_0531\_DAmPs\_Only.csv ExpName/ExpName\_Analysis1/CPP\_Compare/ \-Three PDFs will be created after this analysis: 1\) CPPs only for the YKO. 2\) CPPs from YKO and YKD as different colors. 3\) YKO, YKD and RF as different colors. \-After generating these files, I check to make sure that the RF interaction scores generally fall into the range of \+2 to \-2. I also look to see if the DAmPs have a different distribution than the YKO. **3.3 Heatmaps with homologs.** Use the Rscript at the following file path: /media/data/Santos\_Sean/GTF\_files/**20\_0328\_heatmaps\_Z\_lm\_wDAmPs\_andHomology.R** Uses the following arguments: 1) EXP\_REMcReady.csv-finalTableWithShift.csv 1) Output folder 1) 17\_0503\_DAmPs\_Only.txt (see path below) 1) Yeast\_Human\_Homology\_Mapping\_biomaRt\_18\_0920.csv (see path below) Usage: Rscript 20\_0328\_heatmaps\_Z\_lm\_wDAmPs\_andHomology.R EXP\_REMcReady.csv-finalTableWithShift.csv Heatmaps\_Homologs/ /media/data/Santos\_Sean/GTF\_files/17\_0503\_DAmPs\_Only.txt /media/data/Santos\_Sean/GTF\_files/Yeast\_Human\_Homology\_Mapping\_biomaRt\_18\_0920.csv  \-The script will generate a folder with heatmaps including homology info for the genes in the finalTable file and two .csv files, one with all the yeast genes from the original finalTable (even without homologs) and one list only with the homologs. **4.0 Appendix.** **4.1** **Using R to merge files** (see REMc 1.1 step 1). R can be used to merge two tables using the join function from the plyr package. If you have not installed this package (only need to do this once), open R studio and type install.packages(“plyr”) into the console and follow the instructions to install the package. In R studio modify the following lines by changing the highlighted portion to the path to your files \#open required library for the join function library(plyr) \#read in the files for your experiment X1 \<- read.csv(file="FilePath/Exp1\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) X2 \<- read.csv(file=" FilePath/Exp2\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) \#join the two files, **list the larger file first** – in this example X2 has the larger number of genes. \#if X1 has a larger number of genes, switch the order of X1 and X2 X \<- join(X2,X1,by="OrfRep") \#write new file write.csv(X,file \= " FilePath/DescriptiveName\_withNAs.csv",row.names=F) \-If you need to join more than two tables together you will have to write more than one join. You will first need to join two of the tables, create an object, and then join that object with the next file, and so on… See below for an example of multiple files \#open required library for the join function library(plyr) \#read in the files for your experiment X1 \<- read.csv(file="FilePath/Exp1\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) X2 \<- read.csv(file=" FilePath/Exp2\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) X3 \<- read.csv(file=" FilePath/Exp3\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) X4 \<- read.csv(file=" FilePath/Exp4\_ZScores\_Interaction.csv",stringsAsFactors \= FALSE) \#join the two files, list the larger file first – in this example X2 has the largest number of genes. \#if X1 has a larger number of genes, switch the order of X1 and X2 X \<- join(X2,X1,by="OrfRep") X \<- join(X,X3,by=”OrfRep”) X \<- join(X,X4,by=”OrfRep”) \#write new file write.csv(X,file \= " FilePath/DescriptiveName\_withNAs.csv",row.names=F) **5.2 Removing DAmPs** Use the R script at the following path: /media/data/Santos\_Sean/Q\_HTCP\_Analysis/Exclude\_DAmPs.R Rscript Exclude\_DAmPs.R Input\_File.csv 17\_0503\_DAmPs\_Only.txt output\_file.csv Arg 1 – Use any file with OrfRep column that you want to remove the DAmPs from, for example: Zscores\_Interaction.csv or REMcReady.csv, \-finalTable.csv files Arg 2 \- /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**17\_0503\_DAmPs\_Only.txt** Arg 3 – output file name; make sure to have the .csv extension. **\-Alternatively, create a new script using the following script by Copy and pasting the following lines into R and modify the highlighted sections to match the file you want to** X \<- read.csv(file="path/FileToRemoveDAmPsFrom.csv",stringsAsFactors \= FALSE) Damps \<- read.delim(filepath/**17\_0503\_DAmPs\_Only.txt**",header=F) \#create a column in X1 called ORF so we can remove OrfRep numbers and find all the DAmPs X$ORF \<- X$OrfRep \#remove \_1-4 from newly created ORF column X$ORF \<- gsub("\_1","",x=X$ORF) X$ORF \<- gsub("\_2","",x=X$ORF) X$ORF \<- gsub("\_3","",x=X$ORF) X$ORF \<- gsub("\_4","",x=X$ORF) X \<- X\[\!(X$ORF %in% Damps$V1),\] write.csv(X,file \= "path/output\_file\_noDAmPs.csv",row.names \= FALSE) \*the **17\_0503\_DAmPs\_Only.txt** file is in the following path on the server: /media/data/Santos\_Sean/GTF\_Analysis/**17\_0503\_DAmPs\_Only.txt** \-The above script could also be used to remove any set of genes from another, but you would substitute the 17\_0503\_DAmPs\_Only.txt with a set of OrfReps saved as a tab delimited file with one ORF per line and no header. **5.3 Adjust ZScores for YKO only and remove DAmPs** \-We discussed a script that would remove the DAmPs and also adjust the ZScores for CPPs to only consider the YKO strains in the z-score calculation. Use the file at the following path: /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**Adjust\_YKO\_Zscores\_RemoveDAmPs.R** Usage: Rscript Adjust\_YKO\_Zscores\_RemoveDAmPs.R ZScores\_Interaction.csv 17\_0503\_DAmPs\_Only.txt AdjustedZScores\_noDAmPs/ Arg 1 – Use the Zscores\_Interaction.csv file where you want to adjust the YKO ZScores and remove the damps Arg 2 \- /media/data/Santos\_Sean/Q\_HTCP\_Analysis/**17\_0503\_DAmPs\_Only.txt** Arg3 – a file path to put the files that will be created into (will create a new ZScores\_Interaction.csv with the adjusted scores in the Z\_lm\_ columns, a scatterplot of the initial Z scores vs adjusted, and new rank plots for the adjusted scores). **5.4 Files that are updated from databases.** gene\_ontology\_edit.obo \- [www.geneontology.org/ontology/gene\_ontology\_edit.obo](http://www.geneontology.org/ontology/gene\_ontology\_edit.obo) SGD\_features.tab \- [https://downloads.yeastgenome.org/curation/chromosomal\_feature/](https://downloads.yeastgenome.org/curation/chromosomal\_feature/) go\_terms.tab \- [https://downloads.yeastgenome.org/curation/literature/](https://downloads.yeastgenome.org/curation/literature/) \*\*\*\*\*\*\*\*\*\*Adding information about how JH updated files in summer of ‘23: Updating Q-HTCP Source Files: **gene\_ontology\_edit.obo** Direct link to the latest file: [https://purl.obolibrary.org/obo/go.obo](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fgo.obo\&data=05%7C01%7Cjhartman%40uab.edu%7C0cf8e0ce5ae74943d3bd08dbc1388a83%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316220828714336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=4LRsWT5N44VOOrAX9aby7qWa5ece0NDh6BlhW0r%2FAdA%3D\&reserved=0) ^^copy this into a new text file and give the same name. More info about the file, where it comes from, and the Gene Ontology consortium: [https://geneontology.org/docs/download-ontology/](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeneontology.org%2Fdocs%2Fdownload-ontology%2F\&data=05%7C01%7Cjhartman%40uab.edu%7C0cf8e0ce5ae74943d3bd08dbc1388a83%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316220828714336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=s%2FDjYbaA%2FkK%2FwIKsfGhKnS%2BconR11mkBrnxnlX9QMrw%3D\&reserved=0) You can also use SGD's GO Term Finder, which always has the latest ontology and annotations: [https://www.yeastgenome.org/goTermFinder](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.yeastgenome.org%2FgoTermFinder\&data=05%7C01%7Cjhartman%40uab.edu%7C0cf8e0ce5ae74943d3bd08dbc1388a83%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316220828714336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=EiFqHvWQI49caPBayi44CbfVfXcqqXjTeN9TMY8%2Faz8%3D\&reserved=0) **SGD\_features.tab** Best to use YeastMine – see notes and files from work/yeast strains/ **’23\_0914\_NewMPFile\_Construction.xlsx’** Need to use the new, updated KO ORF list (made compatible with the final SGD edition of the genome) to get the new gene names, and just replace them in an existing file to update. **Go\_terms.tab** The go\_terms.tab file can be generated using YeastMine with a single click. From the ‘Retrieve GO Terms’ template ([https://yeastmine.yeastgenome.org/yeastmine/template.do?name=GO\_Terms\_Tab\&scope=all](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fyeastmine.yeastgenome.org%2Fyeastmine%2Ftemplate.do%3Fname%3DGO\_Terms\_Tab%26scope%3Dall\&data=05%7C01%7Cjhartman%40uab.edu%7C2eec1f6ea1c14c0971a408dbc13bc3c9%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316235117030364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=1KnL5BeJIw3LHr%2F4vy%2FoM1sh%2FBMV0tXQBYtHpaIqHCQ%3D\&reserved=0)), click the green ’Show Results’ button, voila. You can download the file via the ‘Export’ button on the results page. The updated Go\_terms.tab file was problematic \- it was different from the original in the style of (col 1\) GO\_ID column entries (e.g., ‘GO:0000001’ instead of ‘1’, and (col 3\) ‘biological\_process’ instead of ‘P’. It also had 122 fewer rows/entries (42887 instead of 4409\). The above columns can be fixed by using “text to columns” with ‘:’ as the delimiter, and then doing find/replace for P, F, and C. File updated 2024\_0125 (42,442 rows). Original had 44,009 and last update 42,887… so numbers are getting smaller. One can find updates at: Gene Ontology FAQ: [https://geneontology.org/docs/faq/\#ontology](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeneontology.org%2Fdocs%2Ffaq%2F%23ontology\&data=05%7C02%7Cjhartman%40uab.edu%7C4f087305933b4fa99c0a08dc1df646e3%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638418190840735342%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=bkb%2FAv350Z12qddJZeFLDFgxHYxM%2FWHazBOVJ%2BPeO4Q%3D\&reserved=0) **gene\_association.sgd** The gene\_association.sgd file (‘gaf’) is still served from the Downloads site and can also be accessed even more easily via SGD search. Downloads site: [http://sgd-archive.yeastgenome.org/curation/literature/](https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsgd-archive.yeastgenome.org%2Fcuration%2Fliterature%2F\&data=05%7C01%7Cjhartman%40uab.edu%7C4985afdc22f24a1ff21d08dbc1422ebe%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316262252331525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=Fc4mkGqT2a3UL8gNeObm7huuQR%2FzH1KTWd%2FZjpStlz8%3D\&reserved=0) From SGD Search (search for ‘gaf’, then click category ‘Downloads’): [https://www.yeastgenome.org/search?q=gaf\&category=download\&status=Active](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.yeastgenome.org%2Fsearch%3Fq%3Dgaf%26category%3Ddownload%26status%3DActive\&data=05%7C01%7Cjhartman%40uab.edu%7C4985afdc22f24a1ff21d08dbc1422ebe%7Cd8999fe476af40b3b4351d8977abc08c%7C1%7C0%7C638316262252331525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C\&sdata=JTDNv83D31LvU25UynBuVor7V6X4z28hpU6Sml3HIbc%3D\&reserved=0) \*\*\*\*\*\*\*\*\*\*\*\* [image1]: [image2]: [image3]: [image4]: [image5]: [image6]: [image7]: [image8]: [image9]: [image10]: [image11]: [image12]: [image13]: [image14]: [image15]: [image16]: [image17]: [image18]: [image19]: [image20]: [image21]: [image22]: [image23]: [image24]: