EASY rewrite

This commit is contained in:
2024-07-23 10:54:58 -04:00
parent 0440912859
commit 811a1c2dc4
26 changed files with 5472 additions and 167 deletions

View File

@@ -247,7 +247,8 @@ install_dependencies() {
depends_brew=(graphiz pandoc gd pdftk-java shdoc nano rsync)
depends_perl=(File::Map ExtUtils::PkgConfig GD GO::TermFinder)
depends_r=(BiocManager ontologyIndex ggrepel tidyverse sos openxlsx ggplot2
plyr extrafont gridExtra gplots stringr plotly ggthemes pandoc rmarkdown)
plyr extrafont gridExtra gplots stringr plotly ggthemes pandoc rmarkdown
plotly htmlwidgets)
depends_bioc=(org.Sc.sgd.db)
[[ $1 == "--get-depends" ]] && return 0 # if we just want to read the depends vars
@@ -341,6 +342,10 @@ init_project() {
module easy
# @section EASY
# @description Start an EASY analysis
# Eliminated EstartConsole.m
# TODO Don't create output in the scans folder, put it in an output directory
# TODO The !!Results output files need standardized naming
# TODO Don't perform directory operations in EASY
# * The QHTCPImageFolders and 'MasterPlateFiles' folder are the inputs for image analysis with EASY software.
# * EASY will automatically generate a 'Results' directory (within the ExpJobs/'ExperimentJob' folder) w/ timestamp and an optional short description provided by the user (Fig.2).
# * The 'Results' directory is created and entered, using the "File >> New Experiment" dropdown in EASY.
@@ -400,9 +405,34 @@ module easy
# * When finished, the '!!ResultsStd_.txt' will be about the same file size and it should be used in the following StudiesQHTCP analysis.
# * 'NoGrowth_.txt', and 'GrowthOnly_.txt' files will be generated in the 'PrintResults' folder.
#
# Issues:
# * We need full documentation for all of the current workflow. There are different documents that need to be integrated. This will need to be updated as we make improvements to the system.
# * MasterPlate_ file must have ydl227c in orf column, or else it Z_interaction.R will fail, because it cant calculate shift values.
# * Make sure there are no special characters; e.g., (), “, , ?, etc.; dash and underscore are ok as delimiters
# * Drug_Media_ file must have letter character to be read as text.
# * MasterPlate_ file and DrugMedia_ are .xlsx or .xls, but !!Results_ is .txt.
# * In Z_interactions.R, does it require a zero concentration/perturbation (should we use zero for the low conc, even if its not zero), e.g., in order to do the shift correctly.
# * Need to enable all file types (not only .xls) as the default for GenerateResults (to select MP and DM files as .xlsx).
# * Explore differences between the ELR and STD files - 24_0414; John R modified Z script to format ELR file for Z_interactions.R analysis.
# * To keep time stamps when transferring with FileZilla, go to the transfer drop down and turn it on, see https://filezillapro.com/docs/v3/advanced/preserve-timestamps/
# * Could we change the MasterPlateFiles folder label in EASY to MasterPlate_DrugMedia (since there should be only one MP and there is also a DM file required?
# * I was also thinking of adding a MasterPlateFilesOnly folder to the QHTCP directory template where one could house different MPFiles (e.g., with and without damps, with and without Refs on all MPs, etc; other custom MPFiles, updated versions, etc)
# * Currently updated files are in 23_1011_NewUpdatedMasterPlate_Files on Mac (yeast strains/23_0914…/)
# * For EASY to report cell array positions (plate_row_column) to facilitate analyzing plate artifacts. The MP File in Col 3 is called LibraryLocation and is reported after Specifics in the !!Results.
# * Can EASY/StudiesQ-HTCP be updated at any time by rerunning with updated MP file (new information for gene, desc, etc)- or maybe better to always start with a new template?
# * Need to be aware of file formatting to avoid dates (e.g., with gene names like MAY24, OCT1, etc, and with plate locations 1E1, 1E2, etc)- this has been less of a problem.
# * In StudiesQHTCP folders, remember to annotate Exp1, Exp2, in the StudyInfo.csv file.
# * Where are gene names called from for labeling REMc heatmaps, TSHeatmaps, Z-interaction graphs, etc? Is this file in the QHTCP code folder, or is it in the the results file (and thus ultimately the MP file)?
# * Is it ok for a MasterPlate_ file to have multiple sheets (e.g., readme tab- is only the first tab read in)?
# * What are the rules for pulling information from the MasterPlateFile to the !!Results_ (e.g., is it the column or the Header Name, etc that is searched? Particular cells in the DrugMedia file?).
# * Modifier, Conc are from DM sheet, and refer to the agar media arrays. OrfRep is from MasterPlate_ File. Specifics (Last Column) is experiment specific and accommodate designs involving differences across the multi-well liquid arrays. StrainBkGrd (now Library location) is in the 3rd column and reported after Specifics at the last col of the !!Results.. file.
# * Do we have / could we make an indicator- work in progress or idle/complete with MP/DM and after gen-report. Now, we can check for the MPDMmat.mat file, or we can look in PrintResults, but would be nice to know without looking there.
# * File>>Load Experiment wasnt working (no popup to redirect). Check this again.
easy() {
debug "Running: ${FUNCNAME[0]}"
EASY="/mnt/data/EASY/EasyDev2024/BU/EASY240430AppExported/EstartConsole.m"
#EASY="/mnt/data/EASY/EasyDev2024/BU/EASY240430AppExported/EstartConsole.m"
EASY="/mnt/data/EASY/EasyDev2024/BU/EASY240505AppExported/EASYConsole.m"
pushd "$SCAN_DIR" || return 1
@@ -447,6 +477,8 @@ easy() {
! ((YES)) && ask "Start EASY in MATLAB? This requires a GUI." && matlab -nosplash -r "$EASY"
# glob EASY output and make sure it exists
# currently this is just for informative purposes of how to glob some of the EASY output files
# The EASY output files need to be standardized
shopt -s nullglob
EASY_RESULTS_DIRS=( Results* )
shopt -u nullglob
@@ -508,7 +540,11 @@ module qhtcp
# * Do not double-click on the file from the directory.
# * When prompted, navigate to the ExpJobs folder and the PrintResults folder within the correct job folder.
# * Repeat this for every Exp# folder depending on how many experiments are being performed.
# * Note: Before doing this, its a good idea to compare the ref and non-ref CPP average and median values. If they are not approximately equal, then may be helpful to standardize Ref values to the measures of central tendency of the Non-refs, because the Ref CPPs are used for the z-scores, which should be centered around zero.
# * This script will copy the !!ResultsStd file (located in /PrintResults in the relevant job folder in /ExpJobs **rename this !!Results file before running front end; we normally use the STD (not the ELR file) chosen to the Exp# directory as can be seen in the “Current Folder” column in MATLAB, and it updates StudiesDataArchive.txt file that resides in the /StudiesQHTCP folder. StudiesDataArchive.txt is a log of file paths used for different studies, including timestamps.
#
# Do this to document the names, dates and paths of all the studies and experiment data used in each study. Note, one should only have a single !!Results… file for each /Exp_ to prevent ambiguity and confusion. If you decide to use a new or different !!Results… sheet from what was used in a previous “QHTCP Study”, remove the one not being used. NOTE: if you copy a !!Results… file in by hand, it will not be recorded in the StudiesDataArchive.txt file and so will not be documented for future reference. If you use the ExpFrontend.m utility it will append the new source for the raw !!Results… to the StudiesDataArchive.txt file.
# As stated above, it is advantageous to think about the comparisons one wishes to make so as to order the experiments in a rational way as it relates to the presentation of plots. That is, which results from sheets and selected interaction … .R, user modified script, is used in /Exp1, Exp2, Exp3 and Exp4 as explained in the following section.
# TODO MUST CLEAN UP QHTCP TEMPLATE DIRECTORY
#
# Code/Directory Structure:
@@ -827,6 +863,7 @@ qhtcp() {
QHTCP_TEMPLATE_DIR="$SCRIPT_DIR/templates/qhtcp"
STUDY_TEMPLATE_DIR="$QHTCP_TEMPLATE_DIR/ExpTemplate"
OUTPUT_DIR="/mnt/data/StudiesQHTCP"
STUDIES_ARCHIVE="$OUTPUT_DIR/StudiesDataArchive.txt"
QHTCP_DIR="$OUTPUT_DIR/$PROJECT"
if [[ -d $QHTCP_DIR ]]; then
@@ -843,23 +880,22 @@ qhtcp() {
fi
# Print current studies
STUDY_INFO_FILE="$QHTCP_DIR/Code/StudyInfo.csv"
[[ -f $STUDY_INFO_FILE ]] &&
echo "Current studies from $STUDY_INFO_FILE" &&
cat "$STUDY_INFO_FILE"
STUDY_INFO="$QHTCP_DIR/Code/StudyInfo.csv"
[[ -f $STUDY_INFO ]] &&
echo "Current studies from $STUDY_INFO" &&
cat "$STUDY_INFO"
# Ask user edit STUDY_INFO_FILE
if ! ((YES)) && ask "Would you like to edit $STUDY_INFO_FILE to add or modify studies?"; then
# Ask user to edit STUDY_INFO
if ! ((YES)) && ask "Would you like to edit $STUDY_INFO to add or modify studies?"; then
cat <<-EOF
Give each experiment the labels you wish to be used for the plots and specific files.
Enter the desired Experiment names and order them in the way you want them to appear in the REMc heatmaps
EOF
nano "$STUDY_INFO_FILE"
nano "$STUDY_INFO"
fi
# Sets STUDIES_NUM
get_studies "$STUDY_INFO_FILE"
get_studies "$STUDY_INFO"
# Initialize missing dirs
for s in "${STUDIES_NUM[@]}"; do
@@ -872,43 +908,26 @@ qhtcp() {
fi
done
# MATLAB stuff
cat <<-EOF
ExpFrontend.m was made for recording into a spreadsheet
('StudiesDataArchive.txt') the date and files used (i.e., directory paths to the
!!Results files used as input for Z-interaction script) for each multi-experiment study.
Run the front end MATLAB programs in the correct order (e.g., run front end in "exp1"
folder to call the !!Results file for the experiment you named as exp1 in the StudyInfo.csv file)
The GTA and pairwise, TSHeatmaps, JoinInteractions and GTF Heatmap scripts use this table
to label results and heatmaps in a meaningful way for the user and others.
The BackgroundSD and ZscoreJoinSD fields will be filled automatically according to user
specifications, at a later step in the QHTCP study process.
Open MATLAB and in the application navigate to each specific /Exp folder,
call and execute ExpFrontend.m by clicking the play icon.
Use the "Open file" function from within Matlab.
Do not double-click on the file from the directory.
When prompted, navigate to the ExpJobs folder and the PrintResults folder within the correct job folder.
Repeat this for every Exp# folder depending on how many experiments are being performed.
The Exp# folder must correspond to the StudyInfo.csv created above.
EOF
if ! ((YES)) &&
ask "Start MATLAB to run ExpFrontend.m? This requires a GUI."; then
matlab -nosplash
fi
mat_exp_frontend
# Enter REMc directory to run the scripts there
pushd "$QHTCP_DIR/REMc" || return 1
# Run modules
r_join_interact
java_extract
r_add_shift_values
r_heat_maps_zscores
r_heat_maps_homology
popd || return 1
for s in "${STUDIES_NUM[@]}"; do
study_dir="$QHTCP_DIR/Exp$s"
# Z_InteractionTemplate.R
r_interactions "$study_dir" "!!Results"*
done
# Enter REMc directory to run the scripts there
pushd "$QHTCP_DIR/REMc" || return 1
# Run modules
r_join_interact
java_extract
r_add_shift_values
r_heat_maps_zscores
r_heat_maps_homology
popd || return 1
}
@@ -961,6 +980,95 @@ gta() {
# * Functions you do not want to perform by default (submodules should be called modules)
# * Should not call cd or pushd (let module dictate)
submodule mat_exp_frontend
# @description Run the ExpFrontend.m program
mat_exp_frontend() {
debug "Running: ${FUNCNAME[0]}"
# MATLAB stuff
cat <<-EOF
ExpFrontend.m was made for recording into a spreadsheet
('StudiesDataArchive.txt') the date and files used (i.e., directory paths to the
!!Results files used as input for Z-interaction script) for each multi-experiment study.
Run the front end MATLAB programs in the correct order (e.g., run front end in "exp1"
folder to call the !!Results file for the experiment you named as exp1 in the StudyInfo.csv file)
The GTA and pairwise, TSHeatmaps, JoinInteractions and GTF Heatmap scripts use this table
to label results and heatmaps in a meaningful way for the user and others.
The BackgroundSD and ZscoreJoinSD fields will be filled automatically according to user
specifications, at a later step in the QHTCP study process.
Open MATLAB and in the application navigate to each specific /Exp folder,
call and execute ExpFrontend.m by clicking the play icon.
Use the "Open file" function from within Matlab.
Do not double-click on the file from the directory.
When prompted, navigate to the ExpJobs folder and the PrintResults folder within the correct job folder.
Repeat this for every Exp# folder depending on how many experiments are being performed.
The Exp# folder must correspond to the StudyInfo.csv created above.
EOF
if ! ((YES)) &&
ask "Start MATLAB to run ExpFrontend.m? This requires a GUI."; then
matlab -nosplash
fi
[[ -f $STUDIES_ARCHIVE ]] || (err "$STUDIES_ARCHIVE missing"; return 1)
}
submodule r_interactions
# @description Run the R interactions analysis
# TODO don't want to rename Z_InteractionTemplate.R because that will break logic, just edit in place instead
# Is this script interactive
# @arg $1 string The current working directory
r_interactions() {
debug "Running: ${FUNCNAME[0]}"
cat <<-EOF
In each /Exp# folder, rename the Z_InteractionTemplate.R script according to the experiment focus
Example: Interaction, Experimenter Initials, Experiment Focus --> int_RM_2PE.R
5. Open the renamed interaction script, and edit each one beginning at the ++BEGIN USER DATA SELECTION++
This is designed so that the data of interest for each experiment is appropriately selected from the !!Results…txt file
The user can edit, step through, and test the R script without running through the whole routine by observing the resultant data table created in RStudio.
The Z_InteractionTemplate.R script has a collection of code lines that have been used for prior analyses (generally to select data from various !!Results…txt files), which may be commented out (if not relevant), reused as needed, and/or modified for a new study. These include lines associated with the removal of dAmps, specific concentrations, and items described in the Specifics and Media, i.e., information specific to a particular experiment design. There are also code lines to replace gene names OCT1/YKL134C /MAY24/YPR153W and that get converted to date format in excel, by using only the ORF name and to remove data rows with Blank listed; these lines of code convenient to reuse. Hopefully, these code lines can be used, commented out, or adapted to aid the user in modifying this section to the specific data requirements of the study. As a new user data filter code is developed for each Study (and vetted), those lines can be added to the InteractionTemplate230119.R code in the /StudyTemplate folders to aid in future studies.
6. Open a terminal, navigate to each /Exp# folder, and execute the (customized) Z_InteractionTemplate_…” script by using the command line below:
Rscript RenamedInteractionTemplate.R \!\!Results… .txt
**need to change wording to choose SD of Delta_Background to exclude Data from analysis.
[1] "Be sure to enter Background noise filter standard deviation i.e., 3 or 5 per Sean"
Enter a Standard Deviation value to noise filter >>
[1] Enter Standard deviation value for removing data for cultures due to high background (e.g., contaminated cultures). Generally set this very high (e.g., 20) on the first run in order NOT to remove data, e.g. 20. Review QC data and inspect raw image data to decide if it is desirable to remove data, and then rerun analysis.
Enter a Background SD threshold for EXCLUDING culture data from further analysis:
The script will request for the user to input a Background Standard Deviation Value. This Background value removes data where there is high pixel intensity in the background regions of a spot culture (i.e., suspected contamination). 5 is a minimum recommended value, because lower values result in more data being removed, and often times this is undesirable if contamination occurs late after the carrying capacity of the yeast culture is reached. This is most often “trial and error”, meaning there is a Frequency_Delta_Background.pdf report in the /Exp_/ZScores/QC/ folder to evaluate whether the chosen value was suitable (and if not the analysis can simply be rerun with a more optimal choice). In general, err on the high side, with BSD of 10 or 12…. One can also use EZview to examine the raw images and individual cultures potentially included/excluded as a consequence of the selected value. Background values are reported in the results sheet and so could also be analyzed there..
(For new terminal users, directory navigation tips are described below)
To navigate to the directory one can use the directory GUI (in X2Go, use the GUI to navigate to desired operating directory and then from the File menu, choose “Open in Terminal)
Alternatively, navigate there through the terminal window: pwd “prints the current working directory”, ls “lists” the subfolders in the current directory. cd followed by the name of the subdirectory will move down into it. “cd .. “ changes to the parent directory
The tab key can be used to autofill unique characters after typing the initial letters of a folder or file you wish to call.
The template structure above assists the user with organization and management of Q-HTCP files and provides a uniform directory structure to streamline reference across different users and experiments.
Since we are systematically comparing perturbations, most Q-HTCP studies will consist of either 2 or 4 experiment subfolders.
The Zscores files are used for subsequent analyses, including REMc, GTA and Term Specific Heatmaps. These further analyses are described below and can be completed in any order and/or concurrently from separate terminals.
**Annotate Files produced and comment out code that produces files that are obsolete or clutter.
EOF
script="$1/Z_InteractionTemplate.R"
debug "$RSCRIPT $script"
"$RSCRIPT" "$script"
# 1. Path to input file
# 2. /output/ directory
# 3. Path to StudyInfo.csv
# 4. Standard deviation value
}
submodule r_join_interact
# @description JoinInteractExps3dev.R creates REMcRdy_lm_only.csv and Shift_only.csv
r_join_interact() {