Quick notes on barcode distances
We briefly discussed the distance of two random barcodes.
Conclusion: A barcode of 15 nt is not long enough and a barcode of 20nt should be enough and we can very safely combine barcodes within 2nt.
Below are some quick calculations for the expected distances.
Assuming p=25% of each base pair for barcode with length [len], the distance [k] follows a binomial distribution.
Therefore, mean distance,
mean = n*p
standard deviation of distance,
sd = sqrt(n*p*(1-p))
for a particular distance k=i
Probability(dist = k) = len!/(len-i)!/i! * (1-p)^i * p^(len-i)
See below for a summary of probabilities for expected distances when we have different length of barcodes varying from 14 to 22,
Number of pairwise comparisons
Total = N*(N-1)/2, N is the total number of barcodes (let's use N=1 million)
use Probability times Total, we have the number of barcode pairs following the following counts
In reality, because we have another barcode for TS, we have ~100,000 barcodes to be distinguished. The counts are as follows (100 fold less than above table)
This table above is the most relevant one to look at.
Therefore, it seems like the barcode of 15 nt is not enough and barcode of 20nt should be enough and we can safely combine barcodes within 2nt.
We can confirm the result by a simple simulation, for instance, len=15
Assuming we don't have equal base pairs, assume 27%/27%/23%/23% as an extreme case. for 15bp,
I think the profiles are not that different if the fractions of nucleotides are reasonable, at least of the same magnitude. The above calculation is quite rough...
Conclusion: A barcode of 15 nt is not long enough and a barcode of 20nt should be enough and we can very safely combine barcodes within 2nt.
Below are some quick calculations for the expected distances.
Assuming p=25% of each base pair for barcode with length [len], the distance [k] follows a binomial distribution.
Therefore, mean distance,
mean = n*p
standard deviation of distance,
sd = sqrt(n*p*(1-p))
for a particular distance k=i
Probability(dist = k) = len!/(len-i)!/i! * (1-p)^i * p^(len-i)
See below for a summary of probabilities for expected distances when we have different length of barcodes varying from 14 to 22,
Number of pairwise comparisons
Total = N*(N-1)/2, N is the total number of barcodes (let's use N=1 million)
use Probability times Total, we have the number of barcode pairs following the following counts
In reality, because we have another barcode for TS, we have ~100,000 barcodes to be distinguished. The counts are as follows (100 fold less than above table)
This table above is the most relevant one to look at.
Therefore, it seems like the barcode of 15 nt is not enough and barcode of 20nt should be enough and we can safely combine barcodes within 2nt.
We can confirm the result by a simple simulation, for instance, len=15
Which is in accordance with above calculations.
Assuming we don't have equal base pairs, assume 27%/27%/23%/23% as an extreme case. for 15bp,
for 20 bp,
I think the profiles are not that different if the fractions of nucleotides are reasonable, at least of the same magnitude. The above calculation is quite rough...
Cool. Two concerns I had:
ReplyDelete1) The bases aren't exactly 25/25/25/25%, which increases the likelihood of closer barcodes. In a real dataset, this increases p only to 0.252.
2) More importantly, there are kmer sequences within the barcodes that re-occur more often than expected by chance. For any length kmer, the frequencies of each kmer sequence ought to be Poisson Distributed. However, the actual kmer frequencies are over-dispersed (Fano factor of 148 for 4mers; this isn't as bad as it sounds--the mean occurrences of each 4mer is 5,000). This means that certain 4mer sequences are occurring more often than others, which will increase the proximity of barcodes to one another.
The most common sequences seem to be repeats of the same base:
GGTT
TTTC
GGGT
TGGT
GTTT
TGTT
TTGT
Anyways, these are reasons why proximity might still be a little more severe than expected, but not appreciably so.
Thank you for the reply. We can assume a case of 27/27/23/23 and see how distance changes, see last paragraph, I think the result won't be changed that much. Bt the kmers, it is somewhat problematic~ Hope it would work out given that the sequencing error rate is much lower now.
ReplyDelete
ReplyDeleteAfter a long time, I read a very beautiful and very important article that I enjoyed reading. I have found that this article has many important points, I sincerely thank the admin of this website for sharing it. Best Ean Barcode Bundles Services Provider