Usage

This section handles how to use StrainChoosr. While it goes over command line options, the API for the run_strainchoosr function is a mirror of the command line, so it should be applicable to both.

StrainChoosr is built to find the most diverse subset of strains from a phylogenetic tree. To do this, it uses a greedy algorithm described in Pardi2005 and Steel2005. Briefly, the algorithm starts by finding the two leaves of a tree with the greatest distance between them, and then iteratively adds the leaf that adds the most diversity possible to that tree until the desired number of strains has been found.

The basic usage is simple - from the command line the following will choose the 5 most diverse strains from the newick-formatted treefile provided and print them to screen. Additionally, an HTML report (default name strainchoosr_output.html) will be created that visualizes where on the tree your selected strains fall will be created.

strainchoosr --treefile /path/to/tree.nwk --number 5

It may be that you know for sure that you want certain strains selected, and go for the most diverse strains after that. To do that, you can specify starting strains from the command line. For example, the following will start with Strain1 and Strain3, and add the 3 strains to those that create the most diversity.

strainchoosr --treefile /path/to/tree.nwk --number 5 --starting_strains Strain1 Strain3

Another possibility is that you have preferences for strains you want selected, but they aren’t absolute - for this case, you can make strains more or less likely to be chosen by modifying their weights. To do this, create a tab-separated file with strain names in column one and the weight to modify by in column two - numbers greater than one increase the chances a strain will be chosen, and numbers less than one decrease chances. This is done by multiplying branch length, and so won’t have any effect of 0 length branches.

The following weight file would make Strain1 more likely to be chosen and Strain3 less likely.:

Strain1 2
Strain3 0.2

You would then include this weight file (pretend you named it weights.tsv) in your comma would then include this weight file (pretend you named it weights.tsv) in your command line call.

strainchoosr --treefile /path/to/tree.nwk --number 5 --weight_file weights.tsv

A few other options that provide minor tweaks are available - full usage is below:

usage: strainchoosr [-h] -t TREEFILE -n NUMBER [NUMBER ...] [-o OUTPUT_NAME]
                    [--tree_mode {r,c}] [--weight_file WEIGHT_FILE]
                    [--starting_strains STARTING_STRAINS [STARTING_STRAINS ...]]
                    [--color COLOR] [--verbosity {debug,info,warning}] [-v]

StrainChoosr uses the greedy algorithm described in Pardi 2005/Steel 2005 to
find the most diverse subset of strains from a phylogenetic tree.

optional arguments:
  -h, --help            show this help message and exit
  -t TREEFILE, --treefile TREEFILE
                        Path to treefile, in newick format.
  -n NUMBER [NUMBER ...], --number NUMBER [NUMBER ...]
                        Number of representatives wanted. More than one can be
                        specified, separated by spaces.
  -o OUTPUT_NAME, --output_name OUTPUT_NAME
                        Base output name for file. PUT MORE INFO HERE.
  --tree_mode {r,c}     Mode to display output trees in - choose from r for
                        rectangular or c for circular. Defaults to
                        rectangular.
  --weight_file WEIGHT_FILE
                        Path to file specifying weights for leaves in tree.
                        File must be tab-separated, with leaf names in the
                        first column and weights in the second. Leaves not
                        listed will be assigned a weight of 1.
  --starting_strains STARTING_STRAINS [STARTING_STRAINS ...]
                        Names of strains that must be included in your set of
                        diverse strains, separated by spaces.
  --color COLOR         Color you want to have selected strains shown as. List
                        of available colors is available at http://etetoolkit.
                        org/docs/latest/reference/reference_treeview.html#ete3
                        .SVG_COLORS Defaults to red.
  --verbosity {debug,info,warning}
                        Choice of how much information you want printed to the
                        terminal. Set debug to see a ridiculous amount of
                        stuff, info for a normal amount, and warning for very
                        minimal output.
  -v, --version         show program's version number and exit