We describe a new algorithm for searching large data sets, by example, using an inverted trie (from reTRIEval) to analyse information content. A dictionary is built with reference or training data then used to analyse test data. This is more than mere motif matching. The entire structure of the reference data is encapsulated by the dictionary. Motifs or key-phrases are discovered by this search-by-example paradigm.
This algorithm is applied to the problem of plagiarism detection. The candidate thesis/assignment is loaded into the dictionary. Then a linear complexity search is performed on all known test data. The algorithm provides efficient detection of verbatim plagiarism with an objective percentage measure of original content.
Redundancy occurs in English language text, computer files and non-coding DNA. This redundancy is used to reverse engineering software, decrypt secret codes and identify individuals by their unique DNA fingerprint.
For DNA fingerprinting, tandem repeats are easily identified by the recurring relative location of the pattern equal to the pattern length. For example, the string: {TCATCATCATC} matches the reference pattern {CATC} at locations 1, 4 and 7 with the relative locations 3, 3. Here follows a sequence of relative locations of the string {CAT} extracted from a 100,000 nucleotide sequence of non-coding human DNA. Notice the characteristic tandem repeat represented by “3 3 3 3 …” used to identify individual people an diagnose genetic disease.
293 27 9 22 17 20 6 20 19 3 3 3 3 3 3 3 3 3 3 6 3 3 142 6 46 80 80 44 70 39 129 58 144 58 62 69 93 128 89 4 6 40 121 5 24 44 13 21 139 110 33 60 37 11 14 61 73 41 25 12 36 11 325 59 46 3 65 10 15 84 68 54 195 182 33 67 15 50 127 46 51 13 120 115 109 68 9 6 20 72 96 41 79 24 79 67 9 51 24 3 43 8 37 4 10 101 326 11 22 25 23 25 28 44 50 19 55 65 108 33 243 4 8 136 66 45 141 19 51 101 14 18 32 21 56 10 35 37 12 16 8 126 7 361 150 13 76 3 3 62 4 61 39 34 47 3 3 163 70
We can deduce information structure without comparing large sequences. The relative location of repeats is sufficient to match large genetic structures. In the same way verbatim plagiarism can be efficiently detected by comparing the relative location of spaces and common letters such as ‘e’ and ‘t’. This technique is tolerant to the insertion and deletion of text after copying the original because most to the relative locations of patterns remain undisturbed.
Maximum entropy for random text increases linearly with the size of the string. We illustrate the high level of redundancy in English text by analysing the information content of Aesop’s Fables (http://www.gutenberg.org/). The total number of characters is 171580. The 3 most frequently occurring characters are: ‘t’ 12082, space 34390 and ‘e’ 16727. For strings of length 7, the maximum entropy of 29.76 is twice the observed entropy. The following table shows the observed and maximum entropy of all strings of length 1 to 7 in Aesop’s Fables.
length Observed Maximum
1 4.25 4.25
2 7.65 8.50
3 10.26 12.75
4 12.20 17.01
5 13.66 21.26
6 14.75 25.51
7 15.55 29.76