Rosetta modeling of cryo-EM data on the cloud

This is a quick and easy guide to get started with atomic model building into EM maps using Rosetta. All the required files and scripts are in the rosetta_scripts directory in the cryoem-cloud-tools folder.

This by no means is an exhaustive protocol for using Rosetta. For a more detailed description of Rosetta please see the following web pages:

If you find Rosetta on AWS useful please cite the following papers:

Part I: Getting setup for Rosetta

In general, Rosetta is expecting the following inputs:

  • Cryo-EM map with a resolution of 2 - 10 Angstroms
    • Note: Rosetta runs faster on smaller 3D volumes. Therefore, you can speed up your calculations by only including voxels that have protein density. For example, for Beta-Galactosidase the original size was 384x384x384 but we were able to window it smaller to 334x249x222 pixels.
  • FASTA file of amino acid sequence for protein to be modeled
  • PDB coordinate file(s) for all chains to be modeled

This modeling workflow below will NOT work for the following complexes:

  • Nucleic acid complexes: If DNA/RNA present, the nucleic acid atomic positions will not be refined.

Obtain Amazon Machine Image ID for Rosetta on AWS

Before getting started, you need to request access to use the Amazon Machine Image (AMI) for Rosetta on AWS. NOTE: We can only offer access to academic researchers, so you can only gain access to this AMI if you have .edu in your email address. Furthermore, Rosetta's license applies to use and distribution of Rosetta software.

To obtain the AMI ID number for this software environment, using your academic (.edu) email address, email the following:

FASTA formatted amino acid sequence file

Have all the sequences in FASTA format. We need a FASTA entry for each chain that we want to build in the map.

Important notes regarding formatting of FASTA files for Rosetta:

  • The FASTA file should include chains separated by ‘/‘, where the next chain starts on the next line.
  • All disordered regions need to be removed. If it is at the beginning or end of a chain simply delete it. If it is in the middle of a chain we need to remove the region and replace it by ‘/‘.

Example files:

Run HHpred sequence alignment

Use HHpred to find homologous sequences that align to your PDB chains:

  1. Navigate to
  2. Input the sequence of your protein (FASTA format) to be modeled by Rosetta and submit to the server with default parameters.
  3. When job finishes, select 'Save' on the summary results webpage, which is located directly above the aligned sequences (Click here to see where 'Save' is located on summary screen). This will save a file in .hhr format
  4. Now, you will need to open this .hhr file using a text editor (e.g. TextEdit on Mac) to select the top 5 alignments. This is approximate, and can be less than 5, but more than one sequence alignment is required.
  5. Save this file with a new name to indicate that it is the selected alignment .hrr file.

Part II: Format HHpred sequence alignment top hits for Rosetta

Running time: minutes

We now need to format the .hrr file (with selected alignments) for Rosetta.

To do this, run

$ /path/to/cryoem-cloud-tools/rosetta_scripts/ --fasta=[fasta].fasta --hhr=[hhrfile].hhr --AMI=[rosetta AMI] 

This will output the following:

Formatting input .hrr and .fasta files to create PDB files for docking into density...

...booting up instance to format input files...

...running Rosetta file preparation on t2.micro instance...

...finished with file preparation, shutting down instance


This script will boot up a single instance on AWS to format the.hrr file for Rosetta, trimming any sidechains not found in the sequence alignment while also changing any residue that is different between the reference models and the experimental sequences.

The input files for this script are the FASTA file and .hhr file from above.

The output PDB files are named <pdb name>_20* —> 2 denotes the alignment type that was used for sequence alignment and the * is the chain number of the sequence in the fast input file.

If you have a model with multiple chains:

  • Make one large pdb with models for each of the chain that we want to build in. The ordering of the chains in the file should match the order of the chains in the large fast file (step 4). This will have a weight of 1 in hybridize.html (step 7). Once this pdb is made, go through the file manually and remove the inappropriate “TER”s (i.e. TER placed in positions where there is no chain end) and change “END” with “TER”. Once that is done run Other pdb models can be included with either

Format output PDB files into list format

Make a text file (.txt extension) that contains a list of all PDB input files and their respective weights. Weight 1 models will be sampled during the initial stages of refinement, whereas weight 0 models will be sampled once the initial structure is built. Note: All weight 1 models need to have the same number of chains as the large FASTA file while the weight 0 models can be fragments of the entire PDB.

Example file:








Part III: Running RosettaCM

Running time: Hours

Now we are ready to run CM. This routine adjust the gross position of the backbone into the EM map:

$ /path/to/cryoem-cloud-tools/rosetta_scripts/ --em_map=[3Dvolume].mrc --fasta=[fasta].fasta --AMI=[rosetta AMI] --pdb_list=[pdb_list].txt

  • Output directory: If no output directory is specified, it will create an automatically named directory.

While the .fasta sequence file and 3D map are required, an important input for this step is the PDB list generated in Part II. These PDB files have been prepared specifically and cannot be created in any other way. Therefore, if you don't have these at this point, go back to Part II and generate them.

When running this command, you will see the following output to your terminal: 

Starting Rosetta model refinement in the cloud ...

Starting Rosetta job on 8 x c4.8xlarge virtual machines on AWS in region us-east-2a (initialization will take a few minutes)

...uploading files to AWS ...

Rosetta job submitted on AWS! Monitor output file: 2017-09-20-123836-Rosetta-CM/rosetta.out to check status of job

At which point, you can follow the output log file to monitor the status of the Rosetta job. The file will be updated every 5 minutes, ultimately returning the output PDB files to the output directory. The output file will have the following type of information present:

Rosetta model refinement started at 2017-09-20T12:48:00UTC

Checking job completion status (updates every 5 minutes)

Rosetta refinements typically take 1 - 6 hours

...running...   (2017-09-20T12:59:00UTC)

...running...   (2017-09-20T13:04:00UTC)

...running...   (2017-09-20T13:09:00UTC)

...running...   (2017-09-20T13:14:00UTC)

Job finished on #1 at 2017-09-20T13:19:00UTC


Rosetta refinement finished. Shutting down instances at 2017-09-20T13:38:00UTC

Output files

When finished, all generated PDB files will be found in the folder:

[output directory]/output/

Expert-level option: Modeling DNA (or any other kind of 'ligand')

Nucleic acid is treated like a ligand by Rosetta and for RosettaCM (see above) it does not really change the backbone conformation of the DNA/RNA. However, for properly modeling protein-DNA contacts we need to have the nucleic acids included while running rosetta CM.

  1. a) add "add_hetatm=1" to hybridize.xml (we might just have this flag for all runs)
  2. b) have the DNA in weight 1 models and the DNA chain should start after the protein chains
  3. c) don’t have the DNA in the FASTA file

Expert-level option: How to deal with poor backbone placement

  1. Segment that region from the EM map (e.g. using UCSF Chimera)
  2. Make a separate input FASTA and pdb for that region
  3. Run RosettaCM for that small region, using only the segmented map and shorted amino acid sequence.

Expert-level option: De novo chain building

Rosetta can build a small length of amino acids de novo. If we want to use this feature then simply delete that region from the input pdbs and have it included in the input FASTA. Note: De novo building is best done for small regions but for when running CM on a large system >800 residues try not to de novo build for mode than 8 amino acids at a stretch.