OWASP File Hash Repository

From OWASP
Revision as of 14:34, 17 April 2012 by Sapao (Talk | contribs)

Jump to: navigation, search
[edit]

What is FHR?

Simply put, FHR is a repository of hashes of files. But the idea is to go beyond just keeping a list of hashes: I want the repository to indicate when the file in question is (part of) a malware or when a file is recognized as benign. Thus, anyone could see the hash of a file to see if it corresponds to a malware file or an already known good file.

Aren't there already other sources for this information?

Yes, and one of the ideas of the project is to aggregate and leverage information from already existing sources. For example, NIST has the NSRL, which provides hashes of known benign files. The problem is that NIST provides this information in a text file whose download is over 1GB in size. Other known sources are Team Cymru's MHR, SANS Institute's hash database and Virus Total. In addition to aggregating the information, one of the main goals for FHR is to allow free access to its database.

Isn't free access to a database that contains malware dangerous?

Yes, it's dangerous, but the project repository will not contain malware. The repository will only have the hashes of malware, which poses no danger.

Detecting malware using only hashes is not good strategy.

Certainly, and the project is not intended to replace the current anti-virus scanners. However, the creation of hashes is more efficient and easier than creating generic virus detection algorithms and it is a strategy which is being used as a complement to traditional antivirus products. Several commercial products include uses of cloud computing as part of their strategies. Unfortunately, the producers of these technologies do not allow queries to their hash databases. With FHR, the goal is to create a freely available database to be used by everyone.

Will the FHR be integrated into antivirus systems?

We intend to develop clients to the FHR database that can scan workstations and query FHR's database to try to identify malware. These clients will be created as a proof of concept and will be open source. It would be great if some antivirus vendors start supporting FHR, but only time will tell.

Technically, how does the FHR work?

As expected, the core of the system is its database of hashes. Today this database runs on MySQL. Around this database, we can develop several query interfaces. Some ideas of protocols for querying the FHR database are:

  • DNS
  • web
  • web services
  • JSON

The current codebase includes a DNS-based query interface.

What data are available in the database?

We currently have the a little more than 20 million files in the database. These come mainly from the NSRL and we included several PE files from Windows Vista and other common software. For each registered file, we have the following information:

  • SHA-1
  • MD5
  • source
  • date when the system saw the hash / file for the first time (not available for the files from NIST)
  • status (GOOD, MALWARE, UNKNOWN, SUSPICIOUS)
  • size
  • certainty (a percentage that indicates the degree of certainty about the status of the file).

Testing the system

To query the FHR, add the suffix .hash.sapao.net to the MD5 or SHA-1 hash (in hex format) of the file. For example:

 dig 84C0C5914FF0B825141BA2C6A9E3D6F4.hash.sapao.net

or

 dig TXT 84C0C5914FF0B825141BA2C6A9E3D6F4.hash.sapao.net

will query the database for the hash 84C0C5914FF0B825141BA2C6A9E3D6F4.

Database schema

The FHR database contains a single table, called File, described below:

mysql> show columns from File in FHR;
+-----------+------------+------+-----+---------+----------------+
| Field     | Type       | Null | Key | Default | Extra          |
+-----------+------------+------+-----+---------+----------------+
| idFile    | int(11)    | NO   | PRI | NULL    | auto_increment |
| SHA1      | char(40)   | YES  | MUL | NULL    |                |
| MD5       | char(32)   | YES  | MUL | NULL    |                |
| size      | mediumtext | YES  |     | NULL    |                |
| source    | char(10)   | YES  |     | NULL    |                |
| date      | date       | YES  |     | NULL    |                |
| status    | char(10)   | NO   |     | NULL    |                |
| certainty | float      | YES  |     | NULL    |                |
+-----------+------------+------+-----+---------+----------------+
8 rows in set (0.00 sec)

Server Implementation Details

The DNS server is implemented in Java and is based on the EagleDNS server, which uses the dnsjava library.

EagleDNS extensions

The EagleDNS server is easily extended. It is based on the concept of Zone Providers, which provide specific implementations for the backend storage of zone data. The server provides two basic providers, one for loading data from simple zone files and another for loading data from a database.

At first glance, it could seem that the database zone provider would be a perfect fit for FHR, but upon a closer examination, we quickly find out that it is not the case. The main reason is that EagleDNS uses the dnsjava Zone class to represent zone data. This implementation requires all zone data to be held in memory, which would be impossible for FHR since it will contain millions of entries, each corresponding to a DNS record.

So we had to extend EagleDNS by implementing its ZoneProvider interface. And we also needed to extend the dnsjava Zone class functionality. This created a problem since the Zone class was not implemented to be extended. This required us to change the dnsjava source code and recompile this library before being able to implement all FHRZoneProvider class.

The diffs for the dnsjava classes are available at OWASP_File_Hash_Repository/dnsjava_diffs

The FHRZoneProvider rationale

TODO.

PROJECT INFO
What does this OWASP project offer you?
RELEASE(S) INFO
What releases are available for this project?
what is this project?
Name: OWASP File Hash Repository (home page)
Purpose: The goal of this project is to build a repository of hashes of executable and source files. This repository can then be queried by clients to determine the status os of files based on their hashes. Some statuses are GOOD, MALWARE, SOURCE CHECKED, etc. This repository can consolidate several available sources (NIST, MHR, VirusTotal, etc) and provide better query capabilities.
License: Apache 2.0 License
who is working on this project?
Project Leader(s):
  • Alexandre Pupo @
how can you learn more?
Project Pamphlet: Not Yet Created
Project Presentation:
Mailing list: Mailing List Archives
Project Roadmap: View
Key Contacts
  • Contact Alexandre Pupo @ to contribute to this project
  • Contact Alexandre Pupo @ to review or sponsor this project
  • Contact the GPC to report a problem or concern about this project or to update information.
current release
Not Yet Published
last reviewed release
Not Yet Reviewed


other releases