toreguru.blogg.se - Video file duplicate finder

To characterize the database, first do some sampling and hand-matching to estimnate the degree of matching within the database. That is to say, if there is a quick and dirty way to say "these two videos can't be matches', then we should use it first, before we even start to confirm a match. However, the more likely case is that matches will be rare, so we will want to minimize the cost of an unsuccessful match. If 100% of the videos match some other video, then we will want to minimize the cost of a successful match. The optimal algorithms for this process will greatly depend on the characteristics of the database, the level to which single and multiple matches exist. So our primary goal should be minimizing the costs of the last step, even if it means adding many early, simple steps. The cost of the last step must be worse than O(1), potentially much worse. The cost of the first two steps is O(1) on the number of videos. Finally, the comparisons needed to match all videos to each other must be performed. Once the extracted data is available, then the cost of performing a comparison must be considered. One key cost to consider is that of extracting the data needed from each video for whatever comparison metrics are to be used. So we really care about the local cost of each comparison in several different domains: 1) Data storage, 2) compute resources, and 3) time.

And the size of the database likely makes the use of cloud computing resources unfeasible. It is important that the comparisons be performed using the compute and time resources available: I doubt a solution that takes months to run will be very useful in a dynamic video database. This is a huge problem, so I've chosen to write a rather lengthy reply to try to decompose the problem into parts that may be easier to solve. Or am I going the completely wrong way? I think I can't be the first one having this problem but I have not had any luck finding solutions.

present the suggestions to the user for final approval.

Maybe let's take one or more areas/pixels inside the image and see how they develope over time. Image comarision just like the first and last frame but at keyframe positions? We would use the same source files we used for bitrate calcluiations because keyframes are heavy depended on the codec and settings. We make a binary string and store it in db and calculate the Levenshtein distance laterĪudio analyisis (bitrate and decibel varaition over time just as bitrate of the video) Iif the bitrate is greater the average its 1 else its 0. then we would only analyze a portion of the video). Then I would look at the bitrate at certain points of time (percentage of the video completed or absolute seconds. I would transcode the video into a vbr videofile with the exact same settings.

developement of the bitrate over time with the same vbr codec.

So I get a binary string which I can store into mysql and do a boolean bit-sum (supported by mysql internally) and count the remaining uneval bits (as well supported internally, that would then be the Levenshtein distance of the bianry strings) I would resample the picture to a thumbnail size and get the average rgb values then serialize pixel by pixel if the color at this pixel is greater/smaller than the average represented by 0 or 1. So if it helps I can convert the video 100 times as well. I don't care the effort to create the metadata, I have enough slaves to do that. So now my question is what layers can you guys think of or do you have a better approach? So as I have a community to verify the results it suffices to deliver "good guesses" with a low miss ratio. Unfortunately the better the comparision the more cpu and memory intensive it gets so I plan on implementing several layers of comparision that begin with very graceful but fast comparision (maby video lengh with a tolerance of 10%) and end with the final comparision that decides whether its really a duplicate (that would be a community vote).

Or are cut off at the beginning and/or end. They can have different quality, are amby cropped, watermarked or have a sequel/prequel. Problem is the videos aren't exact duplicates. Now I need some sort of procedure where I index metadata in a database, and whenever a new video enters the catalog the same data is calculated and matched against in the database. With every video file I have associated semantic and descriptive information which I want to merge duplicates to achive better results for every one. I got a project having a catalog of currently some ten thousand video files, the number is going to increase dramatically.