Cloud Based Improved File Handling and Duplication Removal Using MD5

Data duplication technology usually identifies redundant data quickly and correctly by using file checksum technique. A checksum can determine whether there is redundant data. However, there are the presences of false positives. In order to avoid false positives, we need to compare a new chunk with chunks of data that have been stored. In order to reduce the time to exclude the false positives, current research uses extraction of file data checksum. However, the target file stores multiple attributes such as user id, filename, size, extension, checksum and date-time table. Whenever user uploads a particular file, the system then first calculates the checksum and that checksum is cross verified with the checksum data stored in database. If the file already exists, then it will update the entry else it will make a new entry into the database. The database will be stored into the Azure cloud which will form a connection between application and cloud server via internet. Data de-duplication has an important role in reducing storage consumption to make it affordable to manage in today’s explosive data growth. The main goals of this project is, to maximally reduce the amount of duplicates in one type of NoSQL DBs, namely the key-value store, to maximally increase the process performance such that the backup window is marginally affected, and to design with horizontal scaling in mind such that it would run on a Cloud Platform competitively. As, the project files and a database file will be stored into the Azure cloud, the project will be accessed in the web browser through Azure link.