Created: 2014-05-18 19:46
Updated: 2014-05-18 20:24


The program reads a text file from the stdin and produces a concordance of all the words in the text. Test inputs are provided in the directory test_docs


  • ruby (developed and tested on ruby 2.1.1)


To run the program, the following command can be used.

ruby test_client.rb


The algorithm can be broken down into the following parts.

  • Extract sentences from the given text by using a single white space, period (dot) and a capital letter as the sentence delimiter.
  • Extract words from each sentence by sanitizing the punctuation symbols like comma, exclamation mark, quotation marks, asterick etc. However, punctuation symbols inside a word are preserved. E.g. index-of, it's etc.
  • The algorithm also considers some of the commonly used latin abbreviations like e.g., etc., a.m., p.m. etc. If the word matches a known abbreviation, the leading or trailing punctuations are not removed. The abbreviations used in the program are taken from this wikipedia page
  • Finally the word is added to a result hash, storing the count and line of occurrence.


The program makes certain assumptions about the structure of sentences and usage of words.

  • The program assumes that a sentence always begins with a capital letter and ends with a period.
  • The program ignores abbreviations spanning across multiple words with spaces. E.g. pro tem.
