Inspiration

Sometimes, content has no clear author. If you have no leads on who the author of circulating content is, you have no choice but to work with what you got, the content itself. Linguistics fingerprinting conceptually comes from the idea that every human has a unique writing style. By combining various techniques of analysis together, we can derive certain attributes from bodies of texts, thus potentially allowing you to connect two separate authors together. This type of technology can have the potential to detect LLM usage.

What it does

It takes in text as input, asks the user for delimiters that may surround the text, and then it calculates various properties surrounding the submitted data. Some attributes include n-gram factorial and combination style sequencing to detect patterns of word usage, word usage frequency, unique spelling mistake tracking, number of words per sentence, number of characters per word, encountered categories of interest text classification; all combined to formulate a series of SHA256 checksums that have a special prefix and specific sorted order that can be stored in its collaboration-enabled web interface or processed into a file for storage via the command line interface for further comparative analysis against other linguistic fingerprints.

How we built it

Written mostly in python. Uses flask for web interface and some other library dependencies. Programmed linguistic algorithms using pure python based on some existing concepts like n-gram analysis but added extra steps for higher word coverage (such as n-gram factorials and word placement shifts to enumerate additional word pattern combinations). For LLM usage, used quantized GGUF format LLaMA Guard 2 model (made by Meta for moderation purposes) using llama.cpp local LLM inference.

Challenges we ran into

Making it faster, multi-threaded, and time constraints. Adding local LLM inference.

Accomplishments that we're proud of

  • Proud that this is unique form of combined linguistic and comparative analysis.
  • Proud that it can be scaled, low tier hardware yields less accuracy than hardware that is capable of doing more combinations of the n-gram factorial, yielding to better similarity results (in theory, but can be validated).
  • Proud that the LLM inference is local, not many apps do that but quantization allows for running models on cheaper hardware. Just requires extra dependencies. This keeps document word content private to the server.
  • Proud of the n-gram feature, a known linguistic analysis technique, the algorithm this script uses is completely made up by me but uses the technique as part of the factorial. Essentially, if you pick 5-gram factorial, it does a calculation like 5-gram, 4-gram, ... 2-gram, in which the order of 5 words, 3 words and so on is recorded based on frequency and per each Nth-gram, the words are shifted N combinations. Thus covering all forms of N chunks of words that may appear more than once in the body of text.
  • Proud simple design allows for collaboration across multiple users.
  • Lastly, but most proud of, being able to deliver all these features in such short time with integrity.

What we learned

  • That spelling mistakes and the category of interest the content is associating with are usually the two types of matches encountered when running linguistic fingerprint analysis on average. The spelling is mostly caused by the lack of any mistakes, will yield the same checksum caused by an empty plaintext in the formula.
  • That classical techniques and compute operations can still help with classifications, you don't necessarily need a MLP style model to be able to perform decent analysis and classification.

What's next for Linguana

  • Look into seeing if with enough generated data, if we can fingerprint LLM models using this application
  • Look into more linguistic techniques to add to the framework, it's designed to be easily expandable. More techniques, more opportunity in detecting potential similarities between two fingerprints.
  • More speed optimizations for better results by raising the values for max data to enumerate and process
  • Add a database based on sqlite. Perhaps toggles for how we want data stored. Currently it uses memory, so if the server process dies randomly, you lose data. I still can see some people wanting a bare version of this tool with not that many moving parts, so a toggle is nice to have for data storage options. Similar to the "save fingerprint output to a file" feature.
  • Make it so that you can also edit the categories either in a config file or on the web interface
  • Make the main script use Llama python bindings and not a direct shell call to the compiled inference engine
  • Add OCR integration to pick apart non-text data like images and PDF

Built With

Share this project:

Updates