Rosetta modeling of cryo-EM data on the cloud

This is a quick and easy guide to get started with atomic model building into EM maps using Rosetta. All the required files and scripts are in the rosetta_scripts directory in the cryoem-cloud-tools folder.

This by no means is an exhaustive protocol for using Rosetta. For a more detailed description of Rosetta please see the following web pages:

If you find Rosetta on AWS useful please cite the following papers:


Part I: Getting setup for Rosetta

In general, Rosetta is expecting the following inputs:

  • Cryo-EM map with a resolution of 2 - 10 Angstroms
    • Note: Rosetta runs faster on smaller 3D volumes. Therefore, you can speed up your calculations by only including voxels that have protein density. For example, for Beta-Galactosidase the original size was 384x384x384 but we were able to window it smaller to 334x249x222 pixels.
  • FASTA file of amino acid sequence for protein to be modeled
  • PDB coordinate file(s) for all chains to be modeled

This modeling workflow below will NOT work for the following complexes:

  • Nucleic acid complexes: If DNA/RNA present, the nucleic acid atomic positions will not be refined.

Obtain Amazon Machine Image ID for Rosetta on AWS

Before getting started, you need to request access to use the Amazon Machine Image (AMI) for Rosetta on AWS. NOTE: We can only offer access to academic researchers, so you can only gain access to this AMI if you have .edu in your email address. Furthermore, Rosetta's license applies to use and distribution of Rosetta software.

To obtain the AMI ID number for this software environment, using your academic (.edu) email address, email rosettainthecloud at gmail.com with the subject line 'Rosetta_AMI'.

FASTA formatted amino acid sequence file

Have all the sequences in FASTA format. We need a FASTA entry for each chain that we want to build in the map. 

Important: We recommend that you copy the FASTA sequence directly from UNIPROT. If you are copying the FASTA sequence from a previously published PDB structure, the FASTA sequence might be missing residues and this can trip up Rosetta.

Additional notes regarding formatting of FASTA files for Rosetta:

  • The FASTA file should include chains separated by ‘/‘, where the next chain starts on the next line.
  • All disordered regions need to be removed. If it is at the beginning or end of a chain simply delete it. If it is in the middle of a chain we need to remove the region and replace it by ‘/‘.

Example file

Run HHpred sequence alignment

Use HHpred to find homologous sequences that align to your PDB chains:

  1. Navigate to https://toolkit.tuebingen.mpg.de/#/tools/hhpred
  2. Input the sequence of your protein (FASTA format) to be modeled by Rosetta and submit to the server with default parameters.
  3. When job finishes, select 'Save' on the summary results webpage, which is located directly above the aligned sequences (Click here to see where 'Save' is located on summary screen). This will save a file in .hhr format
  4. Now, you will need to open this .hhr file using a text editor (e.g. TextEdit on Mac) to select the top 5-10 alignments. This is approximate, and can be less than 5, but more than one sequence alignment is required. The more alignments the better the model.
  5. Save this file with a new name to indicate that it is the selected alignment .hrr file.

Part II: Format HHpred sequence alignment & dock top hits for Rosetta

Running time: minutes

We now need to format the .hrr file (with selected alignments) using Rosetta.

To do this, run rosetta_refinement_on_aws.py

$ /path/to/cryoem-cloud-tools/rosetta_scripts/rosetta_refinement_on_aws.py --fasta=[fasta].fasta --hhr=[hhrfile].hhr --AMI=[rosetta AMI] 

This will output the following:

Formatting input .hrr and .fasta files to create PDB files for docking into density...

...booting up instance to format input files...

...running Rosetta file preparation on t2.micro instance...

...finished with file preparation, shutting down instance

 

This script will boot up a single instance on AWS to format the.hrr file for Rosetta, trimming any sidechains not found in the sequence alignment while also changing any residue that is different between the reference models and the experimental sequences.

The input files for this script are the FASTA file and .hhr file from above.

The output PDB files are named <pdb name>_20* —> 2 denotes the alignment type that was used for sequence alignment and the * is the chain number of the sequence in the fast input file.

Dock files <pdb name>_20*.pdb into cryo-EM density

Dock output models named <pdb name>_20*.pdb (e.g 3tnp_201.pdb) into your cryo-EM map.

  1. Open your cryo-EM density in Chimera
  2. Open each file numbered 201, 202, 203, etc. individually and in order
  3. Dock these models into the density
  4. Save each file (e.g. 201.pdb, 202.pdb, etc.) using Chimera: File > Save PDB > Select PDBs and save relative to cryo-EM density
  5. Go into new docked PDB file and change “END” or "ENDMDL" with “TER”
  6. If you notice any 'TER' in these newly saved PDB files, this means that there was missing sequence in original FASTA sequence. Perhaps this is OK, but it will mean that part of your structure won't be built if you have these 'TER' present.

If you have a model with multiple chains:

  1. Open your cryo-EM density in Chimera
  2. Open each file numbered 201, 202, 203, etc. individually and in order for EACH chain. This means for a two chain model (chain 'A' and 'B'), you would open chain A PDBs in order (201, 202, etc.) THEN open chain B PDBs in order (201, 202, etc.)
  3. Dock all models for all chains into the density
  4. When finished, save all chains for a given numbered type (e.g. 201, 202, 203, etc.) into a single file together using Chimera: File > Save PDB > Select PDBs and save relative to cryo-EM density
    1. For example, if you have files chainA_pdb1_201.pdb, chainA_pdb2_202.pdb, chainB_pdb1_201.pdb, chainB_pdb2_202.pdb, you would save chainA_pdb1_201.pdb and chainB_pdb1_201.pdb into a new file (e.g. docked_201.pdb) and then save chainA_pdb2_202.pdb and chainB_pdb2_202.pdb into a new file (e.g. docked_202.pdb).
  5. Go into new docked PDB file and change “END” or "ENDMDL" with “TER”
  6. If you notice any 'TER' in these newly saved PDB files, this means that there was missing sequence in original FASTA sequence. Perhaps this is OK, but it will mean that part of your structure won't be built if you have these 'TER' present.
  7. Run relabechain.pl to label chains in order
  8. Check: you should see the same number of chains in this relabeled chain file as you used originally. For example, for four chains, this relabeled pdb file will have chains labeled A, B, C, D.

Example files: 

Format output PDB files into list format

Make a text file (.txt extension) that contains a list of all PDB input files and their respective weights. Weight 1 models will be sampled during the initial stages of refinement, whereas weight 0 models will be sampled once the initial structure is built. Note: All weight 1 models need to have the same number of chains as the large FASTA file while the weight 0 models can be fragments of the entire PDB.

Example file for a single chain:

pdb_list.txt: 

4ye4_201.pdb 1

1de8_202.pdb 1

3rt9_203.pdb 1

4thu_204.pdb 1

1frt_205.pdb 1

Example file for multiple chains:

pdb_list.txt: 

merge_201.pdb 1

merge_202.pdb 1

merge_203.pdb 1

merge_204.pdb 1

merge_205.pdb 1

Where, merge_201.pdb came from concatenating all _201.pdb files generated for each chain in your model.

 


Part III: Running RosettaCM

Running time: Hours

Now we are ready to run CM. This routine adjust the gross position of the backbone into the EM map:

$ /path/to/cryoem-cloud-tools/rosetta_scripts/rosetta_refinement_on_aws.py --em_map=[3Dvolume].mrc --fasta=[fasta].fasta --AMI=[rosetta AMI] --pdb_list=[pdb_list].txt

  • Output directory: If no output directory is specified, it will create an automatically named directory.

While the .fasta sequence file and 3D map are required, an important input for this step is the PDB list generated in Part II. These PDB files have been prepared specifically and cannot be created in any other way. Therefore, if you don't have these at this point, go back to Part II and generate them.

When running this command, you will see the following output to your terminal: 

Starting Rosetta model refinement in the cloud ...

Starting Rosetta job on 8 x c4.8xlarge virtual machines on AWS in region us-east-2a (initialization will take a few minutes)

...uploading files to AWS ...

Rosetta job submitted on AWS! Monitor output file: 2017-09-20-123836-Rosetta-CM/rosetta.out to check status of job

At which point, you can follow the output log file to monitor the status of the Rosetta job. The file will be updated every 5 minutes, ultimately returning the output PDB files to the output directory. The output file will have the following type of information present:

Rosetta model refinement started at 2017-09-20T12:48:00UTC

Checking job completion status (updates every 5 minutes)

Rosetta refinements typically take 1 - 6 hours

...running...   (2017-09-20T12:59:00UTC)

...running...   (2017-09-20T13:04:00UTC)

...running...   (2017-09-20T13:09:00UTC)

...running...   (2017-09-20T13:14:00UTC)

Job finished on #1 at 2017-09-20T13:19:00UTC

...

Rosetta refinement finished. Shutting down instances at 2017-09-20T13:38:00UTC

Output files

When finished, all generated PDB files will be found in the folder:

[output directory]/output/

Expert-level option: Modeling DNA (or any other kind of 'ligand')

Nucleic acid is treated like a ligand by Rosetta and for RosettaCM (see above) it does not really change the backbone conformation of the DNA/RNA. However, for properly modeling protein-DNA contacts we need to have the nucleic acids included while running rosetta CM.

  1. a) add "add_hetatm=1" to hybridize.xml (we might just have this flag for all runs)
  2. b) have the DNA in weight 1 models and the DNA chain should start after the protein chains
  3. c) don’t have the DNA in the FASTA file

Expert-level option: How to deal with poor backbone placement

  1. Segment that region from the EM map (e.g. using UCSF Chimera)
  2. Make a separate input FASTA and pdb for that region
  3. Run RosettaCM for that small region, using only the segmented map and shorted amino acid sequence.

Expert-level option: De novo chain building

Rosetta can build a small length of amino acids de novo. If we want to use this feature then simply delete that region from the input pdbs and have it included in the input FASTA. Note: De novo building is best done for small regions but for when running CM on a large system >800 residues try not to de novo build for mode than 8 amino acids at a stretch.

Part IV: Running Rosetta Relax

After running RosettaCM, you MUST run Rosetta-Relax to optimize the orientation of the side chains.

Input files needed:

  • Half map from RELION refinement filtered to same resolution with the same B-factor applied as the value calculated by RELION for combined structure
    • Example command: $ relion_image_handler --i run_half1_class001_unfil.mrc --orun_half1_class001_unfil_filtered.mrc --angpix 0.637 --lowpass 2.2 --bfactor 50
  • Lowest energy PDB model from Rosetta-CM
    • The pdb list file (.txt) file will include this as a single entry

 

Example command:

cryoem-cloud-tools/rosetta/rosetta_refinement_on_aws.py --em_map=postprocess_job090_wn.mrc --hhr=hhpred_5556431_editedTop5.hhr --AMI=[ami]--pdb_list=relax_inputlist.txt --sym=betagal.symm --num=36 --num_per_VM=20 -r

 


Part V: Validating your results

After running RosettaCM, you MUST run Rosetta-Relax to optimize the orientation of the side chains.