Abstract
System logs are valuable sources of information to analyze and diagnose systems behavior.
The size of computing systems, and the number of their components, constantly increase.
The volume of generated system logs, is in proportion to this increase.
Storing system logs for big computing systems, requires high amount of storage capacity.
Sensitive data within system logs, raises serious concerns about sharing and publishing them.
Using anonymization methods, to cleanse the sensitive data before publishing them, reduces the usability of anonymized system logs for further analysis.
After certain level of anonymization, the cleansed system logs, lose their semantic, and only remain useful for some statistical analysis.
In this work we address this tradeoff between anonymization and the usefulness of anonymized system logs.
Such that, full anonymization of system logs is guaranteed while the minimum required storage is being used, and still the cleansed system logs are usable for general statistical analysis.
To achieve this goal, (1) first we need to replace all variables -in each log entry- with defined constant values, (2) then map each log entry into a hash-key via a collision-resistance hash function, (3) and after calculating the frequency of each hash-key, (4) optimize the hash-keys based on their frequency of appearance.
Additionally, according to hash-keys frequency, the non-informative hash-keys will be eliminated.
Preliminary results of analyzing Taurus system logs via the proposed method, show up to 95\% reduction in storage capacity, while the precision of statistical analysis remained unchanged, and the full anonymity is guaranteed.
Users
Please
log in to take part in the discussion (add own reviews or comments).