Example usage
The poorly named labtools
python package is a collection of diverse modules that might be of use to Staller Lab members and collaborators.
tools to analyze raw sequencing data
simple tools to shuffle/mutate sequences for library design
(hopefully) tools to run in-house neural net
# add the comment
# @hidden
# anywhere in any cell you want to hide
# works with custom sphinx layout
# Following code is used to style and truncate the pandas dataframe outputs for easier example viewing
def format_index(s):
return f"font-size: 8pt; max-width: 250px; text-overflow: ellipsis; overflow: hidden"
def format(s):
return ["font-size: 10pt;"] *len(s)
def display_df(df, **kwargs):
disp = df.head(10).style.applymap_index(format_index)
disp = disp.applymap_index(format_index, axis = 0)
disp = disp.apply(format)
return disp
# @hidden
Sequencing Analysis Tools
Sort Processing Example
What you need
Also, see documentation for clearer idea of what inputs should look like
data_files : list of str
path to fastq file for each sample in the sort in order of bins
bin_counts : list of int
cells per bin in order of data files
bin_values : list of int
mean or median fluorescence of the bin in order of data files
If you want to look for only perfect matches to certain sequences (see note)
design_file : str
path to a csv with one column headered as “ArrayDNA” which contains your 120 bp AD sequences as DNA
NOTE: If you want to count untransformed plasmid, please add what that tile would look like to your design file. The length does not matter as long as the sequence covers the end of the read. If you used phasing, include additional seqs for each primer possibility. For example, for pMVS219 backbone phased with 4 R primers, you should add the following:
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGATAG”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGATA”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGAT”
“GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA”
NOTE: the read cutoff is 10 reads total summed across all bins This means that if a tile is not found at least 10 times combined across any of the fastq files, it will not be analysed.
If your reads differ from default (see example)
ad_preceder : str
the sequence directly preceding your AD sequence in your reads (anchor sequence)
If you have barcodes, you need the sequence directly preceding them and anteceding them (anchor sequences), otherwise ignore these. See the example with barcodes below.
bc_preceder : str
bc_anteceder : str
What you get out
a dataframe with your AD tiles as indices, normalized scores for tile at each bin, and the tile Activity value
Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work
My preceding anchor sequence is in blue while my AD/tile sequence is in green in this example read.
The anchor sequence preceding the barcode is purple while the anchor sequence anteceding is red. Barcode length is 11 by default. (Not necessary if you don’t have barcodes)
Example read showing the default values. No need to change anything if your sequence matches the defaults.
TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTT…CTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA
Imports
from labtools.adtools import sort
Initialize a sort
my_sort = sort.Sort(["../exampledata/bin1.fastq", "../exampledata/bin2.fastq", "../exampledata/bin3.fastq",
"../exampledata/bin4.fastq"], bin_counts = [100000,100000,100000,100000],
bin_values = [61,141,251,1462], design_file = "../exampledata/unique_seqs.csv")
Get activity values
activities, total_reads, reads_per_bin = my_sort.process()
Opening file bin1.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin2.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin3.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin4.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
# activity + normalized abundance in each bin
display_df(activities)
0 | 1 | 2 | 3 | Activity | |
---|---|---|---|---|---|
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT | 0.251461 | 0.288201 | 0.460339 | 0.000000 | 171.520411 |
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA | 0.097356 | 0.519277 | 0.343084 | 0.040284 | 224.165635 |
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA | 0.352021 | 0.502035 | 0.145944 | 0.000000 | 128.892233 |
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA | 0.378641 | 0.367199 | 0.254160 | 0.000000 | 138.666285 |
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG | 0.330550 | 0.278750 | 0.390700 | 0.000000 | 157.533007 |
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT | 0.361290 | 0.420448 | 0.218262 | 0.000000 | 136.105579 |
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT | 0.315398 | 0.367040 | 0.317562 | 0.000000 | 150.700019 |
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT | 0.393467 | 0.254385 | 0.352148 | 0.000000 | 148.258990 |
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT | 0.566048 | 0.243975 | 0.189977 | 0.000000 | 116.613702 |
AAGAAGAGGAAGAATAAGAATCAAGGTAAGAAGAAACCAACTTGGTTTAAATCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTAAAGCTGCTTCTGCTTGTAAGAAATTGTTG | 0.253394 | 0.491474 | 0.255133 | 0.000000 | 148.793098 |
# the total reads for each sequence
total_reads
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT 44.0
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA 21.0
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA 49.0
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA 16.0
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG 70.0
...
GACCCAACTGAATGGTTTGATTCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTCAATGTTTGGATGATAATTTGACTAGAGAATTGGAATCTAATGATGGTGCTTTG 23.0
TCTACTGATTCTACTCCAATGTTTGATTATGATAATTTGGAAGATAATTCTAAAGATTGGACTTCTTTGTTTGATAATGATATTCCAGTTACTACTGATGATGTTTCTTTGGCTGATAAA 24.0
TCTACTGGTCAAGTCTTGTTTGATATTGATGACTTTAGATGGTTGTTGGATCCAGATGATGAACAATTGGGTAAAGAAGCTATCTTGTCTGATCAATTTGGTAAACCAACTCCAGAGAAT 12.0
GAAGATCCAACTTCTGATTCTGCTATTCAACAATTGTGGAATCAAGGATTCTTGTTTGTTGAATCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAAT 11.0
GAAATTGATCAAATTTCTGATCCAGATAAATTGCCAGTTAATTTGGAACCATTTAGATTGGATCAATTGGAATTTACTGGTGATGATACTTCTGGTGCTGGTTTGAAATTTCAATGGGAT 13.0
Length: 200, dtype: float64
# the reads of each tile per bin
display_df(reads_per_bin)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT | 11.000000 | 13.000000 | 20.000000 | 0.000000 |
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA | 2.000000 | 11.000000 | 7.000000 | 1.000000 |
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA | 17.000000 | 25.000000 | 7.000000 | 0.000000 |
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA | 6.000000 | 6.000000 | 4.000000 | 0.000000 |
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG | 23.000000 | 20.000000 | 27.000000 | 0.000000 |
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT | 5.000000 | 6.000000 | 3.000000 | 0.000000 |
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT | 5.000000 | 6.000000 | 5.000000 | 0.000000 |
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT | 9.000000 | 6.000000 | 8.000000 | 0.000000 |
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT | 9.000000 | 4.000000 | 3.000000 | 0.000000 |
AAGAAGAGGAAGAATAAGAATCAAGGTAAGAAGAAACCAACTTGGTTTAAATCTGGTGCTCAATTTATCTTGAATGCTCAACAATTGGTTAAAGCTGCTTCTGCTTGTAAGAAATTGTTG | 5.000000 | 10.000000 | 5.000000 | 0.000000 |
Locate any tiles, not just perfect matches
If you want to find sequences even if they don’t match your designed sequences, simply do not include the design file. A short (less than 120 bp) tile will probably appear with a significant number of reads. This is probably your untransformed background (cells transformed with a plasmid that did not get a tile). Most of the non-perfect matching tiles will be sequencing errors. You might find a way to map these back to their true tile. A few of the non-perfect matching tiles (likely those with a significant number of reads) will be sequencing library PCR amplification errors. An even fewer number of the non-perfect matching tiles could be mutants that arose within the actual cell, or free tiles. These will probably have a significant number of reads. Someone may put in the effort to distinguish these categories in the future.
my_sort = sort.Sort(["../exampledata/bin1.fastq", "../exampledata/bin2.fastq", "../exampledata/bin3.fastq",
"../exampledata/bin4.fastq"], bin_counts = [100000,100000,100000,100000],
bin_values = [61,141,251,1462])
activities_no_design, numreads_total, _ = my_sort.process()
Opening file bin1.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin2.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin3.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin4.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
display_df(activities_no_design)
0 | 1 | 2 | 3 | Activity | |
---|---|---|---|---|---|
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT | 0.249091 | 0.283698 | 0.467211 | 0.000000 | 172.465991 |
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA | 0.096596 | 0.512002 | 0.348777 | 0.042625 | 227.945173 |
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA | 0.351864 | 0.498670 | 0.149466 | 0.000000 | 129.292147 |
GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA | 0.210811 | 0.528218 | 0.260971 | 0.000000 | 152.841957 |
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA | 0.377151 | 0.363465 | 0.259384 | 0.000000 | 139.360114 |
TCTGAATTGACTTCTACTTTGGGTATTTCTCATAGATTGCCACAATCTTTGACTCCATGTGTTAAGACTGGTTCTTTGCAATCTGGTGGTTTGGTTCAATCTGTTCCATTTGAAGAATTG | 0.327972 | 0.274844 | 0.397184 | 0.000000 | 158.452427 |
GAAATGGCTGATGATAAAGAAGAACAAGAGAAAGATAGAGATAATGAGAATCAAGGTGAAGAAGATCCAACTTGGGCTGATTCTGGTGATCAATTTATTGCTAATTCTCAACAATTGGTT | 0.360305 | 0.416677 | 0.223018 | 0.000000 | 136.707585 |
TCTGAACCACATGTCTTGATTGAAGAATTTATTAGACCAGTTACTGAAGATGTTGGTATTAATTATACTCATCCACAGAATTTGCCAGGTGCTAATAAAGATGGTGTTTCTGTCTTCTTT | 0.313669 | 0.362744 | 0.323586 | 0.000000 | 151.500974 |
CAATTTATCTTGAATGCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAATTCTGGTTCTTTGAAAGATAAACAACCATGT | 0.390706 | 0.251019 | 0.358275 | 0.000000 | 149.153731 |
ATTATTTGGCATTTGTTGGCTAAATCTGGTTTGTCTGGTTTGTCTTCTCATCCATTTATTGATGAATTTATTCCAACTGTTAATCAAGATGATGGTATTTGTTATACTCATCCTAAGAAT | 0.564274 | 0.241688 | 0.194038 | 0.000000 | 117.202291 |
numreads_total
GAAGATCCAACTTGGTTTGATTCTGGTTCTCAATTTATCTTAAATTCTCAACAATTGGTTGAAGCTTTGTCTTTGTGTGATGATGCTGCTGGTTCTCAAGATAGAGAAGAGAATACTAAT 44.0
GAAGCTTTGTCTTTGTGTGATGATTTGTTGGGTGATCAAGATAGAGAAGAGAATGATAATGATGGTGATTTGAAAGATAAACAACCATGTATTGCTGATTATGCTCATTTGGGTCCAGAA 21.0
GATTTGGCTGAAGATGATGAAGTTATGTGTATGGAAGATGAAGTTCAATCTATTCAACCAAATCATGAAAGACCAGATGATGGTCCAGAATTGGAACATGGTTTGGAGAATGGTGCTAAA 49.0
GGTTAATTAAGGCGCGCCACTTCTAAATAAGCGA 24.0
GGTCAAAGGAAGAGAAGGAAGATTACTCCAACTTTGGTTAATGATGAACCAGTTAGATGGCATAAGACTGGTAGAACTAAACCAGTTATGTTGTCTGGTGTTCAAAGAGGTTGTAAGAAA 16.0
...
TCTACTGATTCTACTCCAATGTTTGATTATGATAATTTGGAAGATAATTCTAAAGATTGGACTTCTTTGTTTGATAATGATATTCCAGTTACTACTGATGATGTTTCTTTGGCTGATAAA 24.0
TCTACTGGTCAAGTCTTGTTTGATATTGATGACTTTAGATGGTTGTTGGATCCAGATGATGAACAATTGGGTAAAGAAGCTATCTTGTCTGATCAATTTGGTAAACCAACTCCAGAGAAT 12.0
AATACTCCAACTCCACCATCTTTGGTTGATGGTGTTGCTGGTGATGAAGAAGCATTTGATGAGATGTTTGATCCATTCTTTGAAGAATTGGATTCTATTCCAGAAGCTGCTTTGTGATAA 14.0
GAAGATCCAACTTCTGATTCTGCTATTCAACAATTGTGGAATCAAGGATTCTTGTTTGTTGAATCTTTGTCTTTGTGTGATGATTTGTTGGGTTCTCAAGATAGAGAAGAGAATACTAAT 11.0
GAAATTGATCAAATTTCTGATCCAGATAAATTGCCAGTTAATTTGGAACCATTTAGATTGGATCAATTGGAATTTACTGGTGATGATACTTCTGGTGCTGGTTTGAAATTTCAATGGGAT 13.0
Length: 220, dtype: float64
Get data for reads that include barcodes AND tiles
Support for barcoded only data may or may not be added in the future.
activities_barcoded, total_reads, reads_per_bin = my_sort.process(barcoded = True)
Opening file bin1.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin2.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin3.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin4.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Note that you get back less tiles using this method. The reason for this is because the read must have a locatable barcode AND tile, which is less likely than having one or the other. The primary reason for doing this analysis is assessing per-transformant variation. The assumption is that unique tile-barcode pairs come from unique original transformants.
display_df(activities_barcoded)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/IPython/core/formatters.py:344, in BaseFormatter.__call__(self, obj)
342 method = get_real_method(obj, self.print_method)
343 if method is not None:
--> 344 return method()
345 return None
346 else:
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:385, in Styler._repr_html_(self)
380 """
381 Hooks into Jupyter notebook rich display system, which calls _repr_html_ by
382 default if an object is returned at the end of a cell.
383 """
384 if get_option("styler.render.repr") == "html":
--> 385 return self.to_html()
386 return None
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:1377, in Styler.to_html(self, buf, table_uuid, table_attributes, sparse_index, sparse_columns, bold_headers, caption, max_rows, max_columns, encoding, doctype_html, exclude_styles, **kwargs)
1374 obj.set_caption(caption)
1376 # Build HTML string..
-> 1377 html = obj._render_html(
1378 sparse_index=sparse_index,
1379 sparse_columns=sparse_columns,
1380 max_rows=max_rows,
1381 max_cols=max_columns,
1382 exclude_styles=exclude_styles,
1383 encoding=encoding or get_option("styler.render.encoding"),
1384 doctype_html=doctype_html,
1385 **kwargs,
1386 )
1388 return save_to_buffer(
1389 html, buf=buf, encoding=(encoding if buf is not None else None)
1390 )
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:206, in StylerRenderer._render_html(self, sparse_index, sparse_columns, max_rows, max_cols, **kwargs)
194 def _render_html(
195 self,
196 sparse_index: bool,
(...)
200 **kwargs,
201 ) -> str:
202 """
203 Renders the ``Styler`` including all applied styles to HTML.
204 Generates a dict with necessary kwargs passed to jinja2 template.
205 """
--> 206 d = self._render(sparse_index, sparse_columns, max_rows, max_cols, " ")
207 d.update(kwargs)
208 return self.template_html.render(
209 **d,
210 html_table_tpl=self.template_html_table,
211 html_style_tpl=self.template_html_style,
212 )
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:163, in StylerRenderer._render(self, sparse_index, sparse_columns, max_rows, max_cols, blank)
149 def _render(
150 self,
151 sparse_index: bool,
(...)
155 blank: str = "",
156 ):
157 """
158 Computes and applies styles and then generates the general render dicts.
159
160 Also extends the `ctx` and `ctx_index` attributes with those of concatenated
161 stylers for use within `_translate_latex`
162 """
--> 163 self._compute()
164 dxs = []
165 ctx_len = len(self.index)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style_render.py:258, in StylerRenderer._compute(self)
256 r = self
257 for func, args, kwargs in self._todo:
--> 258 r = func(self)(*args, **kwargs)
259 return r
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/io/formats/style.py:1870, in Styler._apply_index(self, func, axis, level, method, **kwargs)
1867 obj = self.index if axis == 0 else self.columns
1869 levels_ = refactor_levels(level, obj)
-> 1870 data = DataFrame(obj.to_list()).loc[:, levels_]
1872 if method == "apply":
1873 result = data.apply(func, axis=0, **kwargs)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
1065 if self._is_scalar_access(key):
1066 return self.obj._get_value(*key, takeable=self._takeable)
-> 1067 return self._getitem_tuple(key)
1068 else:
1069 # we by definition only have the 0th axis
1070 axis = self.axis or 0
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1256, in _LocIndexer._getitem_tuple(self, tup)
1253 if self._multi_take_opportunity(tup):
1254 return self._multi_take(tup)
-> 1256 return self._getitem_tuple_same_dim(tup)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:924, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
921 if com.is_null_slice(key):
922 continue
--> 924 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
925 # We should never have retval.ndim < self.ndim, as that should
926 # be handled by the _getitem_lowerdim call above.
927 assert retval.ndim == self.ndim
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1301, in _LocIndexer._getitem_axis(self, key, axis)
1298 if hasattr(key, "ndim") and key.ndim > 1:
1299 raise ValueError("Cannot index with multidimensional key")
-> 1301 return self._getitem_iterable(key, axis=axis)
1303 # nested tuple slicing
1304 if is_nested_tuple(key, labels):
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1239, in _LocIndexer._getitem_iterable(self, key, axis)
1236 self._validate_key(key, axis)
1238 # A collection of keys
-> 1239 keyarr, indexer = self._get_listlike_indexer(key, axis)
1240 return self.obj._reindex_with_indexers(
1241 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
1242 )
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexing.py:1432, in _LocIndexer._get_listlike_indexer(self, key, axis)
1429 ax = self.obj._get_axis(axis)
1430 axis_name = self.obj._get_axis_name(axis)
-> 1432 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1434 return keyarr, indexer
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
6067 else:
6068 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
6072 keyarr = self.take(indexer)
6073 if isinstance(key, Index):
6074 # GH 42790 - Preserve name from an Index
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name)
6128 if use_interval_msg:
6129 key = list(key)
-> 6130 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
6133 raise KeyError(f"{not_found} not in index")
KeyError: "None of [Int64Index([0], dtype='int64')] are in the [columns]"
<pandas.io.formats.style.Styler at 0x7f29a0d8dee0>
Support for custom anchor sequences
If your tiles or barcodes have a custom anchor sequence (AKA the non-variable portion of the read that is used to locate the variable portion of the read), you can specify that in the kwargs of your Sort(). This passes the arguments to the pull_AD() function used on each read to locate the sequence of interest (AKA AD or tile).
Use the sequence directly preceding the AD/tile as an anchor sequence. Additional characters between AD/tile and preceding anchor seqeunce will not work
My preceding anchor sequence is in blue while my AD/tile sequence is in green in this example read.
The anchor sequence preceding the barcode is purple while the anchor sequence anteceding is red. Barcode length is 11 by default. (Not necessary if you don’t have barcodes)
Example read
TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTT…CTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA
Specify alternate values in a dictionary and pass that dictionary to Sort.process()
# these are the default values which would work for my example read above
# no input is required if they work for you
kwargs = {"ad_preceder":"GCTAGC", "bc_preceder":"GGGCCCG", "bc_anteceder":"GGAGAGAA", "bclength":11, "ad_length":120}
activities_barcoded, _, _ = my_sort.process(barcoded = True, **kwargs)
Opening file bin1.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin2.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin3.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
Opening file bin4.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
import pandas as pd
pd.set_option('display.max_colwidth', 1)
Example output for pull_AD
This might be useful for someone who wants to use pull_AD to analyze reads for a purpose outside of calculating activities.
from labtools.adtools.finder import pull_AD
read = "TCCCTGCGGGCTCTACTTCATCGGCTAGCGGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAATGATAAATAGATGAGGGCCCGTCAACATAGAAGGAGAGAAACATCTAAAAAAGCGATA"
pull_AD(read, kwargs)
('GGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAA',
'TCAACATAGAA')
Counting tiles in a fastq file
This essentially applies the pull_AD function shown above to every read in the fastq file. You can also pass the kwargs dict to it to specify custom anchor sequences.
from labtools.adtools.counter import seq_counter
seq_counter("../exampledata/mini.fastq")
Opening file mini.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 4
GGTTCTTCTAAATTGAGATGTGATAATAATGCTGCTGCTCATGTTAAATTGGATTCATTTCCAGCTGGTGTTAGATTTGATACATCTGATGAAGAATTGTTGGAACATTTGGCTGCTAAA 1
GAAGAATTGTTTTTACATTTGTCTGCTAAGATTGGTAGATCTTCTAGGAAACCACATCCATTCTTGGATGAATTTATTCATACTTTGGTTGAAGAAGATGGTATTTGTAGAACTCATCCA 3
dtype: int64
Use barcoded = True to count tile and barcode pairs
Use the design_file flag to look for specific tiles
counts = seq_counter("../exampledata/bin1.fastq", barcoded = True, design_to_use="../exampledata/unique_seqs.csv")
counts
Opening file bin1.fastq ...
Reading lines... 0 reads completed
Total parsed reads = 2,000
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[20], line 1
----> 1 counts = seq_counter("../exampledata/bin1.fastq", barcoded = True, design_to_use="../exampledata/unique_seqs.csv")
2 counts
File ~/labtools/src/labtools/adtools/counter.py:96, in seq_counter(fastq, design_to_use, barcoded, only_bcs, **kwargs)
94 design = pd.read_csv(design_to_use)
95 if barcoded:
---> 96 counts = counts.where(counts.index.droplevel(1).isin(design.ArrayDNA)).dropna()
97 else:
98 counts = counts.where(counts.index.isin(design.ArrayDNA)).dropna()
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2180, in Index.droplevel(self, level)
2177 if not isinstance(level, (tuple, list)):
2178 level = [level]
-> 2180 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
2182 return self._drop_level_numbers(levnums)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2180, in <genexpr>(.0)
2177 if not isinstance(level, (tuple, list)):
2178 level = [level]
-> 2180 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
2182 return self._drop_level_numbers(levnums)
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2055, in Index._get_level_number(self, level)
2054 def _get_level_number(self, level) -> int:
-> 2055 self._validate_index_level(level)
2056 return 0
File ~/bioinformatics_exploration/.venv/lib/python3.8/site-packages/pandas/core/indexes/base.py:2046, in Index._validate_index_level(self, level)
2041 raise IndexError(
2042 "Too many levels: Index has only 1 level, "
2043 f"{level} is not a valid level number"
2044 )
2045 elif level > 0:
-> 2046 raise IndexError(
2047 f"Too many levels: Index has only 1 level, not {level + 1}"
2048 )
2049 elif level != self.name:
2050 raise KeyError(
2051 f"Requested level ({level}) does not match index name ({self.name})"
2052 )
IndexError: Too many levels: Index has only 1 level, not 2
Sequence Design Tools
Imports
from labtools import shuffle
Shuffle a sequence
Create shuffles of the alphabet:
shuffles_list, names_list = shuffle.windowed_shuffle("ABCDEFGHIJKLMNOPQRSTUVWYXZ")
shuffles_list
['adcbeFGHIJKLMNOPQRSTUVWYXZ',
'AcbedfGHIJKLMNOPQRSTUVWYXZ',
'ABgcefdHIJKLMNOPQRSTUVWYXZ',
'ABCdhfegIJKLMNOPQRSTUVWYXZ',
'ABCDfigheJKLMNOPQRSTUVWYXZ',
'ABCDEfhjigKLMNOPQRSTUVWYXZ',
'ABCDEFgkhjiLMNOPQRSTUVWYXZ',
'ABCDEFGihljkMNOPQRSTUVWYXZ',
'ABCDEFGHjmkilNOPQRSTUVWYXZ',
'ABCDEFGHIknjlmOPQRSTUVWYXZ',
'ABCDEFGHIJlnomkPQRSTUVWYXZ',
'ABCDEFGHIJKlmnpoQRSTUVWYXZ',
'ABCDEFGHIJKLqnmpoRSTUVWYXZ',
'ABCDEFGHIJKLMopnqrSTUVWYXZ',
'ABCDEFGHIJKLMNqorpsTUVWYXZ',
'ABCDEFGHIJKLMNOtqsrpUVWYXZ',
'ABCDEFGHIJKLMNOPtsruqVWYXZ',
'ABCDEFGHIJKLMNOPQuvtsrWYXZ',
'ABCDEFGHIJKLMNOPQRwuvstYXZ',
'ABCDEFGHIJKLMNOPQRSvtwyuXZ',
'ABCDEFGHIJKLMNOPQRSTxvyuwZ',
'ABCDEFGHIJKLMNOPQRSTUzvywx']