Exporting

Goldilocks provides functions for the exporting of all censused regions metadata or for filtered regions resulting from a query. The examples below follow on from the basic usage instructions earlier in the documentation.

Census Data

For a given sample one may export basic metadata for all regions that included sequence data from that particular sample. The header is as follows:

Key Value
id A unique id assigned to the region by Goldilocks
track1 The value for the region as calculated by the strategy used. By default if a list of tracks is not specified when the strategy is created, there will be just one track named ‘default’. For the majority of ‘basic’ strategies this will be the case.
[track2 ... trackN] Optional further fields will appear for additional tracks, the column header will feature the name of the track. For example, a k-mer counting strategy would feature a column for each k-mer specified to the strategy.
chr The chromosome the region appeared on (as found in the input data)
pos_start The 1-indexed base of the sequence where the region begins (inclusive)
pos_end The 1-indexed base of the sequence where the region ends (inclusive)

Using the my_sample data:

...
g.export_meta("my_sample", sep="\t")

id      default chr     pos_start       pos_end
0       2       2       1       3
1       1       2       2       4
2       2       2       3       5
3       1       2       4       6
4       2       2       5       7
5       1       2       6       8
6       2       2       7       9
7       1       2       8       10
8       0       X       1       3
9       0       X       2       4
10      0       X       3       5
11      0       X       4       6
12      0       X       5       7
13      0       X       6       8
14      0       X       7       9
15      0       X       8       10
16      0       X       9       11
17      0       X       10      12
18      0       X       11      13
19      0       X       12      14
20      1       X       13      15
21      0       one     1       3
22      0       one     2       4
23      0       one     3       5
24      1       one     4       6
25      1       one     5       7
26      1       one     6       8
27      0       one     7       9

FASTA

From any sorting or filtering operation on censused regions, a new Goldilocks object is returned, providing function to output filtered sequence data to FASTA format.

Following on from the example introduced earlier, the example below shows the subsequences of my_sample in the FASTA format, ordered by their appearance in the filtered candidates list, from the highest number of ‘N’ bases, to the lowest.

...
candidates = g.query("max", group="my_sample")
candidates.export_fasta("my_sample")

>my_sample|Chr2|Pos1:3
NAN
>my_sample|Chr2|Pos3:5
NAN
>my_sample|Chr2|Pos5:7
NAN
>my_sample|Chr2|Pos7:9
NAN
>my_sample|Chr2|Pos2:4
ANA
>my_sample|Chr2|Pos4:6
ANA
>my_sample|Chr2|Pos6:8
ANA
>my_sample|Chr2|Pos8:10
ANA
>my_sample|ChrX|Pos13:15
CAN
>my_sample|Chrone|Pos4:6
CAN
>my_sample|Chrone|Pos5:7
ANC
>my_sample|Chrone|Pos6:8
NCA
>my_sample|ChrX|Pos1:3
GAT
>my_sample|ChrX|Pos2:4
ATT
>my_sample|ChrX|Pos3:5
TTA
>my_sample|ChrX|Pos4:6
TAC
>my_sample|ChrX|Pos5:7
ACA
>my_sample|ChrX|Pos6:8
CAG
>my_sample|ChrX|Pos7:9
AGA
>my_sample|ChrX|Pos8:10
GAT
>my_sample|ChrX|Pos9:11
ATT
>my_sample|ChrX|Pos10:12
TTA
>my_sample|ChrX|Pos11:13
TAC
>my_sample|ChrX|Pos12:14
ACA
>my_sample|Chrone|Pos1:3
CAT
>my_sample|Chrone|Pos2:4
ATC
>my_sample|Chrone|Pos3:5
TCA
>my_sample|Chrone|Pos7:9
CAT