Redis + Ruby + Your Code = Code Zauker!

When Google Code was shut down, I have the lucky to read this article on how it worked. The article give you a

Code Zauker image ispired by Don Zauker Daitarn3

bunch of Google-Go code. Because Go-language is too small and unusual for my brain, I took the idea and reimplemented it using Ruby and Redis.

And because my nickname is Daitangio, I decided to call the project like the Daitarn3 antagoinst: the Code Zauker![1]

The beta gem is available here

Why Code Zauker

Google killed GoogleCode but also Google Desktop. In my daily job I need to scan a lot of java sources for reference, and eclipse sometimes is very slow. Windows Search is even slower. Sometimes I need small inputs, sometimes I need simple lookup.

Unix "locate" is very handy to find out files by name, so why was not possible to get a similar tool for my java/ruby/python code?

Even best, I work with sensible (closed source) code, for banking and financial institution. It is illegal to push it on the Internet, so I need a local-only way of searching inside them.

Setup: Redis and Trigrams

Russ Cox article on Google code showed me a very easy way of implementing code searching.  The idea is to split a file in "trigrams", small 3-letter words and to store a flat association trigram -> {files containing it}

for building an index.

In the meantime I was studing redis, which is a very fast in memory database, so I thinked it would fit as a perfect backend.

Redis is good because when you must search for a word (like CodeZauker) you must do a set intersection between all the trigrams of the word and Redis can do it in a very easy and fast way.

So the first beta of CodeZauker is able to indexing and then searching using trigrams.  The Redis index is quite small (keep reading for the details...).

For instance, to search "CodeZauker" you need to search all files which have the following trigrams (AND condition):

["Cod",  "ode", "deZ", "eZa", "Zau", "auk", "uke", "ker"]

Why trigrams

Code is often scattered in a lot of small files (Java is a classic example of this trend, but also ruby/Python/Perl have a similar behavior). When you search something like a function name, you are better to find out a set of candidate files first, or your hard disk heads will burn :) English words are often between 3 and 4 letters long: we can require a search term must be at least 3-character long.

We can split every file content  in a set of "trigrams". Trigrams are better then bi-grams and 4-grams because of the following considerations.

Suppose the text is composed of a subset of the  keyboard characters: we can estimate about 50 characters, for sake of simplicity (C language started using  funny chars  like ~ & *  % / [ ] {  }!  ... :-)

A bi-gram is a space of 50^2 =2500 terms. They are not so much, and the index will be too squeezed.  Every C or Java file will end having the same bigrams.

A 4-gram is about 50^4=6,250,000 terms, and it is a good approach but: (1) you cannot search terms like "if(" or "end" (2) the index will be very huge and difficult to keep in memory. Remember we must manage thousand of  file, not just something small like the Apple product catalog :)

Worst., when a x-gram is long you need more time to parse a document, because you will end up with a vast set space.

A trigram-set has a space of about 125,000 terms, so it is a good compromise.

A Trigram index will squeeze the information kept in a file, like a MD5 computation does. Because we are building a reverse association like

trigram ->  { set of files containing it}

even if a trigram is repeated often, we have only one instance of the file in the set. This will improve memory efficency.

Last, if the trigram index outlined above is kept all in memory, you will have a blazing fast search engine.

Redis

We chose Redis because it is fast on set operations, and give us a very nice interface. It also offer transactions if we need them.

First implementation

The first implementation of Code Zauker uses some clever tricks and has some limitations:
  1. The 3-space trigram is ignored "   ", because it will lead too much collisions
  2. Trigrams are collected in memory in a small ruby set, and then sent to redis for avoiding overloading redis with a lot of repeated call for the same trigram.
  3. The indexer relays on Unix command xargs for concurrent spawn. This solution is very cheap and effective, although not portable under Windows.
I will release soon improvements to code Zauker: in the meantime, give me your feedback!

How to use it

The ruby gems contains two commands:
  1. czindexer The indexer will index the files or directory given as parameters
  2. czsearch Will search on the index.

Zauker dixit: stats

Hibernate 3.2 GA sources are composed of about 1090 files. The total trigrams are less then 30.000, a strong cohesion space.

 

 

[1] In Italy it was called Don Zauker, while in japan it sounds "Zauser". I retain the "k" because it sounds better!

Comment by The real reason Git rocks | Gioorgi.com on 2012-02-11 00:15:53

[...] Darcs. Then I fall in love with Mercurial.  I stuble on git because of my personal projects like Code Zauker and Org-Mode Parser [...]

Comment by Erlang vs Ruby: Er Zauker | Gioorgi.com on 2012-10-10 14:39:09

[...] Erlang vs Ruby: Er Zauker July 5, 2012By Giovanni GiorgiNegli ultimi mesi ho sviluppato un motore di ricerca per il codice, chiamato code zauker. [...]