diff --git a/documentation/DCP-documentation/AWS_hygiene_scripts.md b/documentation/DCP-documentation/AWS_hygiene_scripts.md index 31c777c..49a50a5 100644 --- a/documentation/DCP-documentation/AWS_hygiene_scripts.md +++ b/documentation/DCP-documentation/AWS_hygiene_scripts.md @@ -23,7 +23,8 @@ while True: alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA',NextToken=token) ``` -# Clean out old log groups +## Clean out old log groups + Bash: ```sh diff --git a/documentation/DCP-documentation/advanced_configuration.md b/documentation/DCP-documentation/advanced_configuration.md index 9824f13..384fd0a 100644 --- a/documentation/DCP-documentation/advanced_configuration.md +++ b/documentation/DCP-documentation/advanced_configuration.md @@ -4,6 +4,7 @@ We've tried very hard to make Distributed-CellProfiler light and adaptable, but Below is a non-comprehensive list of places where you can adapt the code to your own purposes. *** + ## Changes you can make to Distributed-CellProfiler outside of the Docker container * **Location of ECS configuration files:** By default these are placed into your bucket with a prefix of 'ecsconfigs/'. @@ -29,14 +30,16 @@ This value can be modified in run.py . * **Distributed-CellProfiler version:** At least CellProfiler version 4.2.4, and use the DOCKERHUB_TAG in config.py as `bethcimini/distributed-cellprofiler:2.1.0_4.2.4_plugins`. * **Custom model: If using a [custom User-trained model](https://cellpose.readthedocs.io/en/latest/models.html) generated using Cellpose, add the model file to S3. We use the following structure to organize our files on S3. -``` + +```text └──    └── workspace      └── model       └── custom_model_filename ``` -* **RunCellpose module:** - * Inside RunCellpose, select the "custom" Detection mode. - In "Location of the pre-trained model file", enter the mounted bucket path to your model. + +* **RunCellpose module:** + * Inside RunCellpose, select the "custom" Detection mode. + In "Location of the pre-trained model file", enter the mounted bucket path to your model. e.g. **/home/ubuntu/bucket/projects//workspace/model/** - * In "Pre-trained model file name", enter your custom_model_filename + * In "Pre-trained model file name", enter your custom_model_filename diff --git a/documentation/DCP-documentation/external_buckets.md b/documentation/DCP-documentation/external_buckets.md index c39e497..e1056e5 100644 --- a/documentation/DCP-documentation/external_buckets.md +++ b/documentation/DCP-documentation/external_buckets.md @@ -1,5 +1,6 @@ # Using External Buckets -Distributed-CellProfiler can read and/or write to/from an external S3 bucket (i.e. a bucket not in the same account as you are running DCP). + +Distributed-CellProfiler can read and/or write to/from an external S3 bucket (i.e. a bucket not in the same account as you are running DCP). To do so, you will need to appropriately set your configuration in run.py. You may need additional configuration in AWS Identity and Access Management (IAM). @@ -21,42 +22,50 @@ If you don't need to add UPLOAD_FLAGS, keep it as the default ''. ## Example configs ### Reading from the Cell Painting Gallery -``` + +```python AWS_REGION = 'your-region' # e.g. 'us-east-1' AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh AWS_BUCKET = 'bucket-name' # Your bucket SOURCE_BUCKET = 'cellpainting-gallery' +WORKSPACE_BUCKET = 'bucket-name' # Likely your bucket DESTINATION_BUCKET = 'bucket-name' # Your bucket UPLOAD_FLAGS = '' ``` ### Read/Write to a collaborator's bucket -``` + +```python AWS_REGION = 'your-region' # e.g. 'us-east-1' AWS_PROFILE = 'role-permissions' # A profile with the permissions setup described above SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh AWS_BUCKET = 'bucket-name' # Your bucket SOURCE_BUCKET = 'collaborator-bucket' +WORKSPACE_BUCKET = 'collaborator-bucket' DESTINATION_BUCKET = 'collaborator-bucket' -UPLOAD_FLAGS = '--acl bucket-owner-full-control --metadata-directive REPLACE' +UPLOAD_FLAGS = '--acl bucket-owner-full-control --metadata-directive REPLACE' # Examples of flags that may be necessary ``` ## Permissions setup + If you are reading from a public bucket, no additional setup is necessary. +Note that, depending on the configuration of that bucket, you may not be able to mount the public bucket so you will need to set `DOWNLOAD_FILES='True'`. -If you are reading from a non-public bucket or writing to a bucket, you wil need further permissions setup. +If you are reading from a non-public bucket or writing to a bucket that is not yours, you wil need further permissions setup. Often, access to someone else's AWS account is handled through a role that can be assumed. Learn more about AWS IAM roles [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html). Your collaborator will define the access limits of the role within their AWS IAM. You will also need to define role limits within your AWS IAM so that when you assume the role (giving you access to your collaborator's resource), that role also has the appropriate permissions to run DCP. ### In your AWS account + In AWS IAM, for the role that has external bucket access, you will need to add all of the DCP permissions described in [Step 0](step_0_prep.md). -You will also need to edit the trust relationship for the role so that ECS and EC2 can assume the role. +You will also need to edit the trust relationship for the role so that ECS and EC2 can assume the role. A template is as follows: -``` + +```json { "Version": "2012-10-17", "Statement": [ @@ -80,6 +89,7 @@ A template is as follows: ``` ### In your DCP instance + DCP reads your AWS_PROFILE from your [control node](step_0_prep.md#the-control-node). Edit your AWS CLI configuration files for assuming that role in your control node as follows: @@ -95,4 +105,4 @@ In `~/.aws/credentials`, copy in the following text block at the bottom of the f [my-account-profile] aws_access_key_id = ACCESS_KEY -aws_secret_access_key = SECRET_ACCESS_KEY \ No newline at end of file +aws_secret_access_key = SECRET_ACCESS_KEY diff --git a/documentation/DCP-documentation/overview.md b/documentation/DCP-documentation/overview.md index 547a26f..fc9fd10 100644 --- a/documentation/DCP-documentation/overview.md +++ b/documentation/DCP-documentation/overview.md @@ -3,6 +3,7 @@ **How do I run CellProfiler on Amazon?** Use Distributed-CellProfiler! Distributed-CellProfiler is a series of scripts designed to help you run a Dockerized version of CellProfiler on [Amazon Web Services](https://aws.amazon.com/) (AWS) using AWS's file storage and computing systems. + * Data is stored in S3 buckets. * Software is run on "Spot Fleets" of computers (or instances) in the cloud. @@ -12,6 +13,7 @@ Docker is a software platform that packages software into containers. In a container is the software that you want to run as well as everything needed to run it (e.g. your software source code, operating system libraries, and dependencies). Dockerizing a workflow has many benefits including + * Ease of use: Dockerized software doesn't require the user to install anything themselves. * Reproducibility: You don't need to worry about results being affected by the version of your software or its dependencies being used as those are fixed. diff --git a/documentation/DCP-documentation/overview_2.md b/documentation/DCP-documentation/overview_2.md index 6f1c8a1..6d27842 100644 --- a/documentation/DCP-documentation/overview_2.md +++ b/documentation/DCP-documentation/overview_2.md @@ -1,4 +1,4 @@ -## What happens in AWS when I run Distributed-CellProfiler? +# What happens in AWS when I run Distributed-CellProfiler? The steps for actually running the Distributed-CellProfiler code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)). We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up. @@ -8,6 +8,7 @@ We'll give an overview of what happens in AWS at each step here and explain what **Step 1**: In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers. When you run `$ python3 run.py setup` to execute the Config, it does three major things: + * Creates task definitions. These are found in ECS. They define the configuration of the Dockers and include the settings you gave for **CHECK_IF_DONE_BOOL**, **DOCKER_CORES**, **EXPECTED_NUMBER_FILES**, and **MEMORY**. @@ -25,6 +26,7 @@ In the Config file you set the number and size of the EC2 instances you want. This information, along with account-specific configuration in the Fleet file is used to start the fleet with `$ python3 run.py startCluster`. **After these steps are complete, a number of things happen automatically**: + * ECS puts Docker containers onto EC2 instances. If there is a mismatch within your Config file and the Docker is larger than the instance it will not be placed. ECS will keep placing Dockers onto an instance until it is full, so if you accidentally create instances that are too large you may end up with more Dockers placed on it than intended. @@ -59,6 +61,7 @@ Read more about this and other configurations in [Step 1: Configuration](step_1_ ## How do I determine my configuration? To some degree, you determine the best configuration for your needs through trial and error. + * Looking at the resources your software uses on your local computer when it runs your jobs can give you a sense of roughly how much hard drive and memory space each job requires, which can help you determine your group size and what machines to use. * Prices of different machine sizes fluctuate, so the choice of which type of machines to use in your spot fleet is best determined at the time you run it. How long a job takes to run and how quickly you need the data may also affect how much you're willing to bid for any given machine. @@ -67,12 +70,14 @@ However, you're also at a greater risk of running out of hard disk space. Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking. - ## What does this look like on AWS? +## What does this look like on AWS? + The following five are the primary resources that Distributed-CellProfiler interacts with. After you have finished [preparing for Distributed-CellProfiler](step_0_prep), you do not need to directly interact with any of these services outside of Distributed-CellProfiler. If you would like a granular view of what Distributed-CellProfiler is doing while it runs, you can open each console in a separate tab in your browser and watch their individual behaviors, though this is not necessary, especially if you run the [monitor command](step_4_monitor.md) and/or have DS automatically create a Dashboard for you (see [Configuration](step_1_configuration.md)). - * [S3 Console](https://console.aws.amazon.com/s3) - * [EC2 Console](https://console.aws.amazon.com/ec2/) - * [ECS Console](https://console.aws.amazon.com/ecs/) - * [SQS Console](https://console.aws.amazon.com/sqs/) - * [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/) \ No newline at end of file + +* [S3 Console](https://console.aws.amazon.com/s3) +* [EC2 Console](https://console.aws.amazon.com/ec2/) +* [ECS Console](https://console.aws.amazon.com/ecs/) +* [SQS Console](https://console.aws.amazon.com/sqs/) +* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/) diff --git a/documentation/DCP-documentation/passing_files_to_DCP.md b/documentation/DCP-documentation/passing_files_to_DCP.md index a82b815..c4ae6a1 100644 --- a/documentation/DCP-documentation/passing_files_to_DCP.md +++ b/documentation/DCP-documentation/passing_files_to_DCP.md @@ -4,12 +4,13 @@ Distributed-CellProfiler can be told what files to use through LoadData.csv, Bat ## Metadata use in DCP -Distributed-CellProfiler requires metadata and grouping in order to split jobs. -This means that, unlikely a generic CellProfiler workflow, the inclusion of metadata and grouping are NOT optional for pipelines you wish to use in Distributed-CellProfiler. -- If using LoadData, this means ensuring that your input CSV has some metadata to use for grouping and "Group images by metdata?" is set to "Yes". -- If using batch files or file lists, this means ensuring that the Metadata and Groups modules are enabled, and that you are extracting metadata from file and folder names _that will also be present in your remote system_ in the Metadata module in your CellProfiler pipeline. -You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`. -Metadata of either type can be used for grouping. +Distributed-CellProfiler requires metadata and grouping in order to split jobs. +This means that, unlikely a generic CellProfiler workflow, the inclusion of metadata and grouping are NOT optional for pipelines you wish to use in Distributed-CellProfiler. + +- If using LoadData, this means ensuring that your input CSV has some metadata to use for grouping and "Group images by metdata?" is set to "Yes". +- If using batch files or file lists, this means ensuring that the Metadata and Groups modules are enabled, and that you are extracting metadata from file and folder names _that will also be present in your remote system_ in the Metadata module in your CellProfiler pipeline. +You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`. +Metadata of either type can be used for grouping. ## Load Data @@ -25,14 +26,14 @@ Some users have reported issues with using relative paths in the PathName column You can create this CSV yourself via your favorite scripting language. We maintain a script for creating LoadData.csv from Phenix metadata XML files called [pe2loaddata](https://github.com/broadinstitute/pe2loaddata). -You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. +You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). After loading in your images, use the `Export`->`Image Set Listing` command. You will then need to replace the local paths with the paths where the files can be found in S3 which is hardcoded to `/home/ubuntu/bucket`. If your files are nested in the same structure, this can be done with a simple find and replace in any text editing software. (e.g. Find '/Users/eweisbar/Desktop' and replace with '/home/ubuntu/bucket') -More detail: The [Dockerfile](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/Dockerfile) is the first script to execute in the Docker. +More detail: The [Dockerfile](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/Dockerfile) is the first script to execute in the Docker. It creates the `/home/ubuntu/` folder and then executes [run_worker.sh](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/run-worker.sh) from that point. run_worker.sh makes `/home/ubuntu/bucket/` and uses S3FS to mount your S3 bucket at that location. (If you set `DOWNLOAD_FILES='True'` in your [config](step_1_configuration.md), then the S3FS mount is bypassed but files are downloaded locally to the `/home/ubuntu/bucket` path so that the paths are the same as if it was S3FS mounted.) @@ -53,7 +54,7 @@ To use a batch file, your data needs to have the same structure in the cloud as ### Creating batch files -To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. +To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups. More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput). Put the `CreateBatchFiles` module at the end of your pipeline and ensure that it is selected. Add a path mapping and edit the `Local root path` and `Cluster root path`. @@ -71,8 +72,8 @@ Note that if you do not follow our standard file organization, under **#not proj ## File lists -You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format. -These must be the absolute paths that Distributed-CellProfiler will see, aka relative to the root of your bucket (which will be mounted as `/bucket`. +You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format. +These must be the absolute paths that Distributed-CellProfiler will see, aka relative to the root of your bucket (which will be mounted as `/bucket`. ### Creating File Lists diff --git a/documentation/DCP-documentation/step_0_prep.md b/documentation/DCP-documentation/step_0_prep.md index 378603c..29b218e 100644 --- a/documentation/DCP-documentation/step_0_prep.md +++ b/documentation/DCP-documentation/step_0_prep.md @@ -1,22 +1,26 @@ # Step 0: Prep + There are two classes of AWS resources that Distributed-CellProfiler interacts with: 1) infrastructure that is made once per AWS account to enable any Distributed-CellProfiler implementation to run and 2) infrastructure that is made and destroyed with every run. -This section describes the creation of the first class of AWS infrastructure and only needs to be followed once per account. +This section describes the creation of the first class of AWS infrastructure and only needs to be followed once per account. ## AWS Configuration + The AWS resources involved in running Distributed-CellProfiler are configured using the [AWS Web Console](https://aws.amazon.com/console/) and a setup script we provide ([setup_AWS.py](../../setup_AWS.py)). -You need an active AWS account configured to proceed. +You need an active AWS account configured to proceed. Login into your AWS account, and make sure the following list of resources is created: ### 1.1 Manually created resources + * **Security Credentials**: Get [security credentials](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) for your account. Store your credentials in a safe place that you can access later. * **SSH Key**: You will probably need an ssh key to login into your EC2 instances (control or worker nodes). [Generate an SSH key](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) and store it in a safe place for later use. If you'd rather, you can generate a new key pair to use for this during creation of the control node; make sure to `chmod 600` the private key when you download it. -* **SSH Connection**: You can use your default AWS account VPC, subnet, and security groups. +* **SSH Connection**: You can use your default AWS account VPC, subnet, and security groups. You should add an inbound SSH connection from your IP address to your security group. ### 1.2 Automatically created resources + * BEFORE running setup_AWS, you need to open `lambda_function.py` and edit the `BUCKET_NAME` (keeping the quotes around the name) at the top of the file to be the name of your bucket. After editing, Line 12 of `lambda_function.py` should look like `bucket = "my-bucket-name"`. * Run setup_AWS by entering `python setup_AWS.py` from your command line. @@ -29,15 +33,19 @@ It will automatically create: * a Monitor lambda function that is used for auto-monitoring of your runs (see [Step 4: Monitor](step_4_monitor.md) for more information). ### 1.3 Auxiliary Resources + *You can certainly configure Distributed-CellProfiler for use without S3, but most DS implementations use S3 for storage.* + * [Create an S3 bucket](http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) and upload your data to it. Add permissions to your bucket so that [logs can be exported to it](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html) (Step 3, first code block). ### 1.4 Increase Spot Limits + AWS initially [limits the number of spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html) you can use at one time; you can request more through a process in the linked documentation. Depending on your workflow (your scale and how you group your jobs), this may not be necessary. ## The Control Node + The control node is a machine that is used for running the Distributed-CellProfiler scripts. It can be your local machine, if it is configured properly, or it can also be a small instance in AWS. We prefer to have a small EC2 instance dedicated to controlling our Distributed-CellProfiler workflows for simplicity of access and configuration. @@ -50,12 +58,15 @@ The control node needs the following tools to successfully run Distributed-CellP These instructions assume you are using the command line in a Linux machine, but you are free to try other operating systems too. ### Create Control Node from Scratch + #### 2.1 Install Python 3.8 or higher and pip + Most scripts are written in Python and support Python 3.8 and 3.9. Follow installation instructions for your platform to install Python. pip should be included with the installation of Python 3.8 or 3.9, but if you do not have it installed, install pip. #### 2.2 Clone this repository and install requirements + You will need the scripts in Distributed-CellProfiler locally available in your control node.
     sudo apt-get install git
@@ -68,6 +79,7 @@ You will need the scripts in Distributed-CellProfiler locally available in your
 
#### 2.3 Install AWS CLI + The command line interface is the main mode of interaction between the local node and the resources in AWS. You need to install [awscli](http://docs.aws.amazon.com/cli/latest/userguide/installing.html) for Distributed-CellProfiler to work properly: @@ -81,22 +93,27 @@ When running the last step (`aws configure`), you will need to enter your AWS cr Make sure to set the region correctly (i.e. us-west-1 or eu-east-1, not eu-west-2a), and set the default file type to json. #### 2.1.4 s3fs-fuse (optional) + [s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) allows you to mount your s3 bucket as a pseudo-file system. It does not have all the performance of a real file system, but allows you to easily access all the files in your s3 bucket. Follow the instructions at the link to mount your bucket. ### Create Control Node from AMI (optional) + Once you've set up the other software (and gotten a job running, so you know everything is set up correctly), you can use Amazon's web console to set this up as an Amazon Machine Instance, or AMI, to replicate the current state of the hard drive. Create future control nodes using this AMI so that you don't need to repeat the above installation. ## Removing long-term infrastructure + If you decide that you never want to run Distributed-CellProfiler again and would like to remove the long-term infrastructure, follow these steps. ### Remove Roles, Lambda Monitor, and Monitor SNS +
 python setup_AWS.py destroy
 
### Remove EC2 Control node -If you made your control node as an EC2 instance, while in the AWS console, select the instance. -Select `Instance state` => `Terminate instance`. \ No newline at end of file + +If you made your control node as an EC2 instance, while in the AWS console, select the instance. +Select `Instance state` => `Terminate instance`. diff --git a/documentation/DCP-documentation/step_3_start_cluster.md b/documentation/DCP-documentation/step_3_start_cluster.md index 2e2d9b2..77f269a 100644 --- a/documentation/DCP-documentation/step_3_start_cluster.md +++ b/documentation/DCP-documentation/step_3_start_cluster.md @@ -18,7 +18,9 @@ Once the spot fleet is ready: Your job will begin shortly! *** + ## Configuring your spot fleet request + Definition of many of these terms and explanations of many of the individual configuration parameters of spot fleets are covered in AWS documentation [here](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html) and [here](http://docs.aws.amazon.com/cli/latest/reference/ec2/request-spot-fleet.html). You may also configure your spot fleet request through Amazon's web interface and simply download the JSON file at the "review" page to generate the configuration file you want, though we do not recommend this as Distributed-CellProfiler assumes a certain fleet request structure and has only been tested on certain Amazon AMI's. Looking at the output of this automatically generated spot fleet request can be useful though for obtaining values like your VPC's subnet and security groups, as well the ARN ID's of your roles. @@ -36,6 +38,7 @@ If there is no template fleet file for your region, or the one here is too out-o If you have a good working configuration for a region that isn't represented or for a more up-to-date version of the AMI than we've had time to test, please feel free to create a pull request and we'll include it in the repo! ## Parameters that must be configured in the spot fleet in DCP 1 (but not current versions) + These parameters were present in the spot fleet request in first version of DCP but not subsequent versions. We provide the information here because we have not officially deprecated DCP 1, however we strongly encourage you to use a more updated version. @@ -49,7 +52,7 @@ AWS has a handy [price history tracker](https://console.aws.amazon.com/ec2sp/v1/ If your jobs complete quickly and/or you don't need the data immediately you can reduce your bid accordingly. Jobs that may take many hours to finish or that you need results from immediately may justify a higher bid. -## To run in a region where a spot fleet config isn't available or is out of date: +## To run in a region where a spot fleet config isn't available or is out of date * Under EC2 -> Instances select "Launch Instance" diff --git a/documentation/DCP-documentation/step_4_monitor.md b/documentation/DCP-documentation/step_4_monitor.md index fcd78e2..fa154a9 100644 --- a/documentation/DCP-documentation/step_4_monitor.md +++ b/documentation/DCP-documentation/step_4_monitor.md @@ -2,6 +2,7 @@ Your workflow is now submitted. Distributed-CellProfiler will keep an eye on a few things for you at this point without you having to do anything else. + * Each instance is labeled with your APP_NAME, so that you can easily find your instances if you want to look at the instance metrics on the Running Instances section of the [EC2 web interface](https://console.aws.amazon.com/ec2/v2/home) to monitor performance. * You can also look at the whole-cluster CPU and memory usage statistics related to your APP_NAME in the [ECS web interface](https://console.aws.amazon.com/ecs/home). * Each instance will have an alarm placed on it so that if CPU usage dips below 1% for 15 consecutive minutes (almost always the result of a crashed machine), the instance will be automatically terminated and a new one will take its place. @@ -9,28 +10,32 @@ Distributed-CellProfiler will keep an eye on a few things for you at this point If you choose to run the Monitor script, Distributed-CellProfiler can be even more helpful. -## Running Monitor +## Running Monitor ### Manually running Monitor + Monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`. While the optimal time to initiate Monitor is as soon as you have triggered a run as it downscales infrastructure as necessary, you can run Monitor at any point in time and it will clean up whatever infrastructure remains. **Note:** You should run the monitor inside [Screen](https://www.gnu.org/software/screen/), [tmux](https://tmux.github.io/), or another comparable service to keep a network disconnection from killing your monitor; this is particularly critical the longer your run takes. ### Using Auto-Monitor -Instead of manually triggering Monitor, you can have a version of Monitor automatically initiate after you [start your cluster](step_3_start_cluster.md) by setting `AUTO_MONITOR = 'True'` in your [config file](step_1_configuration.md). -Auto-Monitor is an AWS Lambda function that is triggered by alarms placed on the SQS queue. + +Instead of manually triggering Monitor, you can have a version of Monitor automatically initiate after you [start your cluster](step_3_start_cluster.md) by setting `AUTO_MONITOR = 'True'` in your [config file](step_1_configuration.md). +Auto-Monitor is an AWS Lambda function that is triggered by alarms placed on the SQS queue. Read more about the [SQS Queue](SQS_QUEUE_information.md) to better understand the alarm metrics. ## Monitor functions ### While your analysis is running + * Scales down the spot fleet request to match the number of remaining jobs WITHOUT force terminating them. This happens every 10 minutes with manual Monitor or when the are no Visible Messages in your queue for Auto-Monitor. * Deletes the alarms for any instances that have been terminated in the last 24 hours (because of spot prices rising above your maximum bid, machine crashes, etc). This happens every hour with manual Monitor or when the are no Visible Messages in your queue for Auto-Monitor. ### When your queue is totally empty (there are no Visible or Not Visible messages) + * Downscales the ECS service associated with your APP_NAME. * Deletes all the alarms associated with your spot fleet (both the currently running and the previously terminated instances). * Shuts down your spot fleet to keep you from incurring charges after your analysis is over. @@ -43,8 +48,8 @@ This happens every hour with manual Monitor or when the are no Visible Messages If you are manually triggering Monitor, you can run the monitor in an optional "cheapest" mode, which will downscale the number of requested machines (but not RUNNING machines) to one machine 15 minutes after the monitor is engaged. You can engage cheapest mode by adding `True` as a final configurable parameter when starting the monitor, aka `python run.py monitor files/APP_NAMESpotFleetRequestId.json True` -Cheapest mode is cheapest because it will remove all but 1 machine as soon as that machine crashes and/or runs out of jobs to do; this can save you money, particularly in multi-CPU Dockers running long jobs. -This mode is optional because running this way involves some inherent risks. +Cheapest mode is cheapest because it will remove all but 1 machine as soon as that machine crashes and/or runs out of jobs to do; this can save you money, particularly in multi-CPU Dockers running long jobs. +This mode is optional because running this way involves some inherent risks. If machines stall out due to processing errors, they will not be replaced, meaning your job will take overall longer. Additionally, if there is limited capacity for your requested configuration when you first start (e.g. you want 200 machines but AWS says it can currently only allocate you 50), more machines will not be added if and when they become available in cheapest mode as they would in normal mode. @@ -55,7 +60,7 @@ Additionally, if there is limited capacity for your requested configuration when The JSON monitor file containing all the information Distributed-CellProfiler needs will have been automatically created when you sent the instructions to start your cluster in the [previous step](step_3_start_cluster). The file itself is quite simple and contains the following information: -``` +```json {"MONITOR_FLEET_ID" : "sfr-9999ef99-99fc-9d9d-9999-9999999e99ab", "MONITOR_APP_NAME" : "2021_12_13_Project_Analysis", "MONITOR_ECS_CLUSTER" : "default", @@ -66,4 +71,4 @@ The file itself is quite simple and contains the following information: ``` For any Distributed-CellProfiler run where you have run [`startCluster`](step_3_start_cluster) more than once, the most recent values will overwrite the older values in the monitor file. -Therefore, if you have started multiple spot fleets (which you might do in different subnets if you are having trouble getting enough capacity in your spot fleet, for example), Monitor will only clean up the latest request unless you manually edit the `MONITOR_FLEET_ID` to match the spot fleet you have kept. \ No newline at end of file +Therefore, if you have started multiple spot fleets (which you might do in different subnets if you are having trouble getting enough capacity in your spot fleet, for example), Monitor will only clean up the latest request unless you manually edit the `MONITOR_FLEET_ID` to match the spot fleet you have kept. diff --git a/documentation/DCP-documentation/troubleshooting_runs.md b/documentation/DCP-documentation/troubleshooting_runs.md index 6c9d287..322e858 100644 --- a/documentation/DCP-documentation/troubleshooting_runs.md +++ b/documentation/DCP-documentation/troubleshooting_runs.md @@ -29,6 +29,7 @@ Services/behaviors that are as expected and/or not relevant for diagnosing a pro | Jobs moving to dead messages | "CP PROBLEM: Done file reports failure." | No files are output to S3 | N/A | Something went wrong in your CellProfiler pipeline. | Read the logs above the CP PROBLEM message to see what the specific CellProfiler error is and fix that error in your pipeline. | Further hints: + - The SSH_KEY_NAME in the config.py file contains the name of the key pair used to access AWS. This field is the name of the file with the .pem extension (SSH_KEY_NAME = "MyKeyPair.pem"). The same name is used in the fleet configuration file (e.g. exampleFleet.json) but without using the .pem extension ("KeyName": "MyKeyPair"). diff --git a/documentation/DCP-documentation/versions.md b/documentation/DCP-documentation/versions.md index 3755e75..d0f7066 100644 --- a/documentation/DCP-documentation/versions.md +++ b/documentation/DCP-documentation/versions.md @@ -1,25 +1,38 @@ # Versions The most current release can always be found on DockerHub at `cellprofiler/distributed-cellprofiler`. -Current version is 2.0.0. -Our current tag system is dcpversion_CellProfilerversion, e.g. 1.2.0_3.1.8 +Current version is 2.2.0. +Our current tag system is DCPversion_CellProfilerversion, e.g. 1.2.0_3.1.8 Previous release versions can be accessed at `bethcimini/distributed-cellprofiler:versionnumber` --- -# Version History +## Version History -## 2.1.0 (forthcoming) -* Increase support for using plugins -* Improve role handling and allow file download/upload from/to S3 buckets in different accounts -* Improve file download to S3FS can be completely bypassed -* Name EBS volumes +### 2.2.0 - Released 20241105 + +* run_batch_general overhauled to be a CLI tool with support for Cell Painting Gallery structure +* Support for AWS IAM assumed roles +* Improved handling of CellProfiler-plugins and update to current CellProfiler-plugins organization +* Adds WORKSPACE_BUCKET to the config so that image files and metadata files can be read off different buckets +* Adds JOB_RETRIES to the config so that the number of retries before sending a job to DeadMessages is configurable +* Adds ALWAYS_CONTINUE to the config so that the flag can be passed to CellProfiler +* Adds ASSIGN_IP to the config and defaults to false so that EC2 spot fleet instances do not automatically get assigned a private IP address + +### 2.1.0 - Released 20230518 + +* Addition of setup_AWS.py to automate AWS infrastructure setup +* Addition of optional auto-monitor +* Addition of auto-dashboard creation +* Addition of auto-Deadletter queue creation +* Improved handling of AWS credentials + +### 2.0.0rc2 - Released 20201110 -## 2.0.0rc2 - Released 20201110 * Add optional ability to download files to EBS rather than reading from S3 (helpful for pipelines that access many files/image sets) -## 2.0.0rc1 - Released 20201105 +### 2.0.0rc1 - Released 20201105 * Remove requirement for boto and fabric, using only boto3 * Add support for Python 3 and CellProfiler 4 @@ -28,28 +41,28 @@ Previous release versions can be accessed at `bethcimini/distributed-cellprofile * Don't cancel a fleet over capacity errors * Add "cheapest" mode to the monitor, allowing you to run more cheaply (at possible expense of running more slowly) -## 1.2.2 - Released 20201103 +### 1.2.2 - Released 20201103 * Allow pipelines using batch files to also designate an input output_top_directory * Add support for multiple LaunchData specifications * Add CellProfiler-plugins * Additional way to create job submissions with run_batch_general.py -## 1.2.1 - Released 20200109, Updated through 20191002 +### 1.2.1 - Released 20200109, Updated through 20191002 * Allow monitor to downscale machines when number of jobs < number of machines * Add a parameter to discount files when running CHECK_IF_DONE checks if less than a certain size -## 1.2.0 - Released 20181108, Updated through 20200109 +### 1.2.0 - Released 20181108, Updated through 20200109 * Improved compatibility with CellProfiler 2 and 3 * Better handling of logging when using output_structure -## 1.1.0 - Released 20170217, Updated 20170221 (bugfixes) - 20181018 +### 1.1.0 - Released 20170217, Updated 20170221 (bugfixes) - 20181018 * Changes in this release: - * Added the `output_structure` variable to the job file, which allows you to structure the output folders created by DCP (ie `Plate/Well-Site` rather than `Plate-Well-Site`). Job files lacking this variable will still default to the previous settings (hyphenating all Metadata items in order they are presented in the Metadata grouping). - * Added support for creating the list of groups via `cellprofiler --print-groups`- see [this issue](https://github.com/CellProfiler/Distributed-CellProfiler/issues/52) for example and discussion. Groups listed in this way MUST use the `output_structure` variable to state their desired folder structure or an error will be returned. + * Added the `output_structure` variable to the job file, which allows you to structure the output folders created by DCP (ie `Plate/Well-Site` rather than `Plate-Well-Site`). Job files lacking this variable will still default to the previous settings (hyphenating all Metadata items in order they are presented in the Metadata grouping). + * Added support for creating the list of groups via `cellprofiler --print-groups`- see [this issue](https://github.com/CellProfiler/Distributed-CellProfiler/issues/52) for example and discussion. Groups listed in this way MUST use the `output_structure` variable to state their desired folder structure or an error will be returned. -## 1.0.0 - Version as of 20170213 +### 1.0.0 - Version as of 20170213