logging - Displaying access log analysis -
I am doing something to analyze logging logs from a catalyst web application, from the load balancers in front of the data web bailer And occurs approximately 35 GB everyday per day. It is stored in the HDFS file system and I use the mapradus (which is great) to reduce the number.
The purpose of the analysis is to try to set up a usage profile - which is the most used, is the average response time for each operation, whether with ability backend or cache - capacity planning, optimization To set thresholds for and for monitoring system. Traditional tools like analog will give me the most requested URL or most used browser but this will not be useful for me. I do not need to know that / controller / foo? Id = 1984
is the most popular URL; I should know that for all the hits for hit rate and feedback, we need to look at / controller / foo
, so I can see that there is room for optimization or caching and try to guess Make sure that this action may be sudden for a double hit.
I can easily break data through MapReduce per data per period per request. The problem demonstrates this in a digestible form and chooses important trends or discrepancies. My output is of the form:
('2009-12-08T08: 30', '/ ctrl_a / action_a') (2440, 895) ('2009-12-08T08: 30', '/ Ctrl_a / action_b') (23 9, 15, 49) ('2009-12-08 tit 8: 30', '/ ctrl_b / action_a') (2167, 0) ('2009-12-08T08: 30 ',' / Ctrl_b / action_b ') (1713, 1184) (' 2009-12-08, tit 08: 31 ',' / ctrl_a / action_a ') (2317, 790) (' 2009-12-08 Tit 8: 31 ',' / ctrl_a / action_b ') (2254, 14 9 7) (' 2009-12-08, tit 8: 31 ',' / ctrl_b / action_a ') (2112, 0) (' 2009-12- 08 Tit 8: 31 ',' / ctrl_b / action_b ') (1644, 10 9 8)
that is, the key is the time duration and value (action, hit, cache Hit) have toplepe
per time period (I do not have to live with it, it's just that Which I have done so far.)
There are about 250 verbs in it, they can be added to a small group group but on the same graph over time the number of requests for each action (or reaction time, Etc.) will probably not work. First of all, it will be a lot of noise and secondly, absolute numbers do not matter much - a 100% / minute increase in requests for frequent, light, cacheable response will increase by Rs. 100 / minute. Less important is rarely used but expensive (probably kills DB) Unfair response In the same graph, we will not be able to see the changes in requests for little use of action.
A stable report is not very good - a huge table of numbers is difficult to digest. If I can remember the important minute-to-minute changes over an hour.
Any suggestions? How are you managing this problem? I think that in some way the response to the rate of action or action must be highlighted in some way by significant changes. A rolling average and standard deviation can show this, but can I do something better?
What other metrics or statistics do I generate?
Comments
Post a Comment