Category Archives: analyzer

What is the logic behind the "order of applying filters" in Apache Lucene

I have implemented my own analyzer in Apache Lucene for specific purposes. There are certain filters to apply before a term is getting indexed. I thought it wouldn't matter to change the application order of filters. But it seems like it does. For example;

analyzer = new Analyzer(){ 
      @Override
      protected TokenStreamComponents createComponents(String fieldName){
        AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
        Tokenizer source = new NGramTokenizer(factory,3,10);
        TokenStream filter = new NewlineFilter(source);
        filter = new LowerCaseFilter(filter);
        filter = new UsefulGrams(getVersion(), filter, usefulGramSet);
        filter = new EmptySpaceFilter(filter);
        return new TokenStreamComponents(source,filter);
      };
  };

My tokenizer is generating grams, then newlines are cut and all cases lowered. After this, I want only the grams that I find "useful" shall be in the index, that filter eliminates the useless grams. At the end, it filters out the grams that consist fully of empty spaces.

With the data set I hold, this order generates 316 indexed terms. But if I change the order of filters;

analyzer = new Analyzer(){ 
      @Override
      protected TokenStreamComponents createComponents(String fieldName){
        AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;
        Tokenizer source = new NGramTokenizer(factory,3,10);
        TokenStream filter = new UsefulGrams(getVersion(), source, usefulGramSet);
        filter = new NewlineFilter(filter);
        filter = new EmptySpaceFilter(filter);
        filter = new LowerCaseFilter(filter);
        return new TokenStreamComponents(source,filter);
      };
  };

This produces 350 indexed terms. Notice the "first" filter has to use SOURCE TOKENIZER, but the other ones use FILTER TOKENSTREAM. If I put SOURCE in every parameter, it gives warnings related to "addsuppression".

My question is, what should be the order of applying these filters? I want to apply all these (all lowercase, just the terms I choose, without empty grams, without newlines), wouldn't think this be changed by anything, apparently it does.

How to analyze httpd (apache webserver) logs in CentOS 7

I have configred a static IP on my server that is using CentOS 7. Here httpd is the webserver. Now I have to analyze following things.

  1. Total number of queries per day (or toall)
  2. Total requests sent and served.
  3. Unique visitors details
  4. List of queries recieved. (status etc.)
  5. HOSTs/OS/Browsers that request for query
  6. Errors etc.
  7. Should be able to save it in CSV etc. format so that I can import it to excel.

Can anyone suggest me some log analyzer that will fullfill above requirements ?