# goldilocks package¶

## goldilocks.cmd module¶

goldilocks.cmd.main()

## goldilocks.goldilocks module¶

class goldilocks.goldilocks.Goldilocks(strategy, data, length, stride, is_pos=False, is_faidx=False, is_pos_file=False, ignore_len_mismatch=False, processes=2)

Bases: object

Facade class responsible for conducting a census of genomic regions.

Given sequence data and a goldilocks.strategies search strategy, Goldilocks is able to census regions along multiple genomes of a desired length and overlap and provides an interface to query results for a given criteria.

Parameters:
• strategy (Strategy object) – An instantiated goldilocks.strategies search strategy.

• data (dict{str, dict{[str|int], str}}) – Data on which to conduct a census, in a nested dict.

Top level keys represent individual samples whose value is another dict using str or int chromosome keys and sequence data as values. As an example:

"my_sample": {
"chrom_name_or_number": "SEQUENCE",
},
"my_other_sample": {
"chrom_name_or_number": "SEQUENCE",
}


When submitting with FASTA, the index for each sample must be provided in the following format:

"my_sample": {
"file": "/path/to/my_sample.fa.fai"
}

• length (int) – Desired region length, all censused regions will be of this many bases. If a region will end such that it would end beyond the end of the highest seen position on a chromosome, it will not be added to the census.

• stride (int) – Number of bases to add to the start of the last region before the start of the next. If LENGTH==STRIDE, there will be no overlap and regions will begin on the base following where the previous region ended.

• is_pos (boolean, optional(default=False)) – Whether or not the data stored in data is sequence data. If is_pos is True, Goldilocks will expect a list of base positions.

• is_faidx (boolean, optional(default=False)) – Whether or not the data in data refers to the locations of FAIDX files. If is_faidx is True, Goldilocks will expect paths to be provided to an file key, for each sample in the data dict.

• processes (int, optional(default=2)) – The number of additional processes to spawn to perform the census.

Variables:
• strategy (Strategy object) – The desired goldilocks.strategies search strategy as selected by the user upon initialisation. Goldilocks will expect the necessary census function to be implemented in the strategy class.
• data (dict{str, dict{[str|int], str}}) – Data on which to conduct a census, in a nested dict.
• LENGTH (int) – Desired region length, as provided by the user.
• STRIDE (int) – Desired region stride, as provided by the user.
• PROCESSES (int,) – Number of processes to spawn and administer during census.
• IS_POS (boolean,) – Whether or not data is expected to contain base-position information rather than sequence data. This will force use of PositionCounterStrategy.
• IS_FAI (boolean,) – Whether or not data is expected to contain references to FASTA index.
• chr_max_len (dict{str, int}) – Maps names of chromosomes to the largest size encountered for that chromosome across all samples
• groups (dict{str, dict{[str|int], str}}) – A copy of the input data dict. If is_faidx, the groups dictionary will be modified to contain information loaded from the FASTA index for each sample.
• num_expected_regions (int) – The total number of regions anticipated to require census.
• counter_matrix (Unlocked 3D numpy array buffer) –

Stores the value returned by the census for a group-track-region triplet. Dimensions are thus:

counter_matrix[group][track][region_i]

• “Total” group, is group 0.
• “Default” track, is track 0.
• Thus, counter_matrix[0][0] represents the aggregate of all groups and tracks for each i.

Once the census is complete the data stored in these counters are used for calculating a target value such as the maximum, minimum, mean or median.

• group_buckets (dict{str, dict{str, dict{[int|float], list{int}}}}) –

Each group contains a dictionary of track-bucket dicts. For each group-track, a dict maps values returned from strategy evaluation to a list of region ids that was evaluated to that value.

In a very basic example where a census is conducted for ‘A’ nucleotides over one sample (group) which features one chromosome of length 16:

1|AAAA..AA.AA.AAAA|16


With a length of 4 and a stride of 4 (ie. an overlap of 0):

ID Start Sequence End Value
0 1 AAAA 4 4
1 5 ..AA 8 2
2 9 .AA. 12 2
3 13 AAAA 16 4

The buckets would be organised as thus:

\ 2 /\ 4 /
|    |
|    > [0,3]
|
> [1,2]


Once the desired ‘target’ value has been calculated (max, min, mean or median), these buckets are used to selected regions (by their ID) that fall inside the desired distance from the target without requiring iteration over all censused regions again.

• regions (dict{int, dict{str, int}}) –

A dict mapping automatically assigned ascending (from 0) integer ids to censused region metadata including the following keys:

Key Value
id The i’th region to be censused.
chr Chromosome on which this region appears
ichr Region is the i’th to appear on this chromosome
pos_start 1-indexed base this region starts on (incl.)
pos_end 1-indexed base this region ends on (incl.)

The ichr can be used to access corresponding counter information from counter_matrix[group][track][ichr].

These ids are also the same ids saved in relevant group_buckets.

• selected_regions (list{int}) – A list of region ids representing the result of a sorting operation after a call to query.
• selected_count (int) – The size of selected_regions, otherwise -1. This is used to decide whether the Goldilocks object should return regions from selected_regions or all regions, as stored in regions.
Raises:

ValueError – If either length or stride are less than one.

candidates

Returns: Candidate List – If a query has been performed with goldilocks.goldilocks.Goldilocks.query(), returns a list of candidates as found in regions, sorted by the func used by that query, sorted and filtered by order and presence in selected_regions.Otherwise, returns regions as a list. list{dict{str, dict{str, int}}}
census()

Conduct a census of genomic subregions of a given size and overlap over chromosomes identified in each submitted sample. For each chromosome, each sample is loaded and split in to regions of the correct size and overlap which are then processed and evaluated by the desired strategy.

export_fasta(groups=None, track='default', to=None, divide=False)

Export all regions held in FASTA format.

export_meta(group=None, track=None, to=<open file '<stdout>', mode 'w'>, fmt='table', sep='\t', overlaps=True, header=True, ignore_query=False, value_bool=False, divisible=None, chr_prefix='')
plot(group=None, tracks=['default'], bins=None, ylim=None, save_to=None, annotation=None, title=None, ignore_query=False, chrom=None, prop=False, bin_max=None)

Represent censused regions in a plot using matplotlib.

query(func='median', track='default', actual_distance=None, percentile_distance=None, direction=0, group='total', limit=0, exclusions=None, use_and=False, use_chrom=False, gmin=None, gmax=None)

Query the Goldilocks census to retrieve regions that meet given criteria.

Parameters:
• func (str, options={“max”, “min”, “mean”, “median”}) – Sorting function to be applied when returning regions which meet the input criteria and from which to calculate distances for use with actual_distance or percentile_distance.

• group (str, optional(default=”total”)) – Sort and filter only using values evaluated by the strategy only from data in the given sample group. Whilst it is possible to census data from many different sample groups in the same census, you may only be interested in isolating regions on a particular sample.

If a group is not provided, by default, the “total” group is used, which represents the aggregate (but not necessarily sum) of values seen at a region site across all samples.

For example, the “total” group for a nucleotide counting strategy would contain the sum of all counted bases for all genomic sub-sequences that lie on the given region for each track in the strategy.

• track (str, optional(default=”default”)) – Sort and filter only using values evaluated by the strategy only from data in the given track. In a simple nucleotide counting example, whilst you can count for multiple bases in the same census, you may wish to query based on data from just ‘N’ bases.

If a track is not provided, by default, the “default” track is used, which represents the aggregate (but not necessarily sum) of values seen at a region site across all tracks.

For example, the “default” track for a nucleotide counting strategy would contain the sum of all counted bases over a given region for a given group.

Note

The “total” group contains a “default” track which for a simple nucleotide counting strategy, would hold the sum of all bases of interest seen across sub-sequences on all groups, over all tracks that lie on the given region.

Note

Ratio-based strategies that do not simply count instances of given bases or motifs etc. will be correctly weighted if the RATIO flag is set in the appropriate strategy class. No other special handling for these sort of strategies is required.

• actual_distance (float, optional(default=None))

• percentile_distance (float, optional(default=None)) – Filter regions whose value as returned from the selected strategy falls outside the absolute or percentile distance from the target as calculated by func.

actual_distance will filter regions whose difference from the target falls outside the given value.

percentile_distance will filter regions whose value falls outside the given number of percentiles.

When used with direction one may decide whether to look above, below or around the target.

Note

actual_distance and percentile_distance are mutually exclusive.

• direction (int, optional(default=0)) – When using actual_distance or percentile_distance one may select whether to select regions that appear within the desired distance above, below or around the target - as calculated by func.

Any positive value will set the direction to “upper”, any negative value will set the direction to “lower”. By default direction is 0, which will search around the target.

For example, to find the 25% of values that appear above the mean, set percentile_distance to 25, func to “mean” and direction to one. To find the 10% of values below the median, set percentile_distance to 10, func to “median” and direction to -1.

To find regions within plus/minus 5.0 of the mean, set func to mean and direction to 0 and actual_distance to 10.

Note

If func is max or min, the direction will automatically be changed to +1 or -1, respectively - as it doesn’t make sense to search “around” the maximum or minimum value.

• gmin (int, optional(default=None)) – Filter any candidates whose value is below (but not equal to) gmin.

• gmax (int, optional(default=None)) – Filter any candidates whose value is above (but not equal to) gmax.

• limit (int, optional(default=0)) – Maximum number of regions to return. By default, all regions that meet the specified criteria will be returned.

• exclusions ([dict{str, dict{str, [int|str|list]}} | dict{[int|str], dict{str, [int|str|list|boolean]}}], optional(default=None)) – A dict defining criteria with which to filter regions.

The dict may be specified in two ways: keys can either be exclusion properties as found in the table below or match the names or numbers of chromosomes provided to the constructor of Goldilocks with dict values that specify exclusion properties to values.

The former method will apply specified exclusions to all regions whereas the latter when use_chrom is set to True will apply exclusions to particular chromosomes.

Currently the following excluding criteria are available:

Criterion Purpose
start_lte Region starts on 1-indexed base less than or equal to value
start_gte Region starts on 1-indexed base greater than or equal to value
end_lte Region ends on 1-indexed base less than or equal to value
end_gte Region ends on 1-indexed base greater than or equal to value
chr Region appears on chr in given list
region_group_lte Ignore candidates whose value for the provided group is lte the value of interest.
region_group_gte Ignore candidates whose value for the provided group is gte the value of interest.

Further information and examples on using these effectively can be found in the documentation on sorting and filtering.

• use_and (boolean, optional(default=False)) – A flag to indicate whether a region must meet all exclusion criteria defined in exclusions to be excluded. By default this is False and a region will be excluded if it meets one or more exclusion criteria.

• use_chrom (boolean, optional(default=False)) – A flag to indicate that the keys of the exclusions dict are chromosome identifiers and exclusion criteria within should be applied to particular chromosomes. By default it is assumed that keys in the exclusions dict are to be applied to all regions, regardless of the chromosome on which they appear.

Note

use_chrom can be used with use_and, all criteria in each block of chromosome specific exclusions must be met for a region on that chromosome to be excluded.

Note

Exclusions that are not inside a key that matches a chromosome will apply to all chromosomes. However, exclusions defined in a sub-dict with a chromosome key, will override the global exclusions that have been set.

Note

Goldilocks will print a warning to stdout if it encounters the name of a chromosome in the exclusions dict without use_chrom being set to true, but will continue to complete the query anyway.

Returns:

Goldilocks object – Returns the current Goldilocks, setting selected_regions to the list of candidates found and selected_count to the length of selected_regions. Sorts are always descending from absolute distance to the target as calculated by func.

Return type:

goldilocks.goldilocks.Goldilocks

Raises:
• TypeError – When attempting to sort by an invalid func.
• ValueError – If attempting to filter both by actual_distance and percentile_distance.
reset_candidates()

## goldilocks.strategies module¶

class goldilocks.strategies.BaseStrategy(tracks=None, label='')

Bases: object

Interface for which census strategies must be compliant.

It is intended that all valid census strategies must inherit from BaseStrategy and provide implementations for each of the methods.

Parameters: tracks (list{str}, optional(default=None)) – A list of strings defining multiple features of interest in the context of this strategy with which to perform the census. For example a simple nucleotide counting strategy will accept a list of nucleotides of interest. By default the argument is None which will cause the TRACKS attribute to be populated with one track; “default”. label (str, optional(default=””)) – A string used to annotate what the values returned by this strategy evaluate method represent, particularly for use on the y-axis in any plots generated. If performing a census for GC Ratio, a suitable label might be “GC Ratio”. A nucleotide counter may use something more generic such as “Nucleotide Count”. TRACKS (list{str}) – A list of strings representing features of interest in the input data to be censused as provided by the user on instantiation. AXIS_LABEL (str) – A string used to establish the context of the values returned from this strategy’s evaluate function, as provided by the user on instantiation.
census(sequence, track, **kwargs)
class goldilocks.strategies.GCRatioStrategy(tracks=None)
census(sequence, track, **kwargs)
class goldilocks.strategies.MotifCounterStrategy(motifs, overlap=True)
census(sequence, track, **kwargs)
class goldilocks.strategies.NucleotideCounterStrategy(bases=None)
census(sequence, track, **kwargs)
class goldilocks.strategies.PositionCounterStrategy(tracks=None)
census(positions, track, **kwargs)
class goldilocks.strategies.ReferenceConsensusStrategy(tracks=None, polarity=1, reference=None)
census(sequence, track, **kwargs)