The research laboratory, founded by Stanford University executive Michael Ovitz and Dr. Walter De Brouwer, answers the problem of intellectual property crime.
Table of Contents
- The research laboratory, founded by Stanford University executive Michael Ovitz and Dr. Walter De Brouwer, answers the problem of intellectual property crime.
- When did you start this lab and how long did it take you to develop the final version of the audio and video identification system?
- In simple terms, how is the neural footprint embedded in the audio or video and how long does this process take?
- It’s quite a revolution compared to the copy control methodology of the early 2000s, right?
- How does SoundPatrol’s neural fingerprint distinguish between legitimate transformations (e.g. remixes or extended versions) and unauthorized use, especially when tracks are heavily modified?
- How does the platform detect AI-generated music versus a human interpretation? And what is the rate of false positives and false negatives in the real world?
- In the event of a dispute, for example: an artist claims that SoundPatrol incorrectly marked their music, what mechanisms are in place to defend or analyze the decision?
- What other clients besides Universal Music Group and Sony are interested in your lab?
- But this technology could have more scope, because perhaps in esports broadcasts, the transmission of the World Cup could be very interesting.
SoundPatrol is a technology company originating from Stanford University that has developed a forensic artificial intelligence model (system or tool designed to analyze digital or multimedia evidence for investigative and legal purposes) for the detection of fingerprints in audio and video.
These traces allow us to analyze musical patterns beyond exact matches, detecting derivatives, covers or remixes that use elements protected by copyright without authorization.
They work through a type of identification, called neural fingerprinting, which is a significant advance over traditional audio fingerprinting techniques, as these are mainly based on the comparison of exact audio fragments. While neural embeddings capture semantic relationships to identify AI versions, remixes, and generative derivatives.
SoundPatrol is a pioneer in cutting-edge artificial intelligence-based music technology. Notably, the project originated with a group of leading academics in artificial intelligence, machine learning and cybersecurity, including Dr. Walter De Brouwer, co-founder and CEO of SoundPatrol; who, along with executive Michael Ovitz, founded the company.
In this context, this solution is presented as an advanced technological response dedicated to protecting sound and image against fraud and piracy. In a digital environment where the unauthorized reproduction of content is increasingly sophisticated, this platform operates continuously: 24 hours a day, 7 days a week. Through an automated surveillance system capable of detecting unauthorized activities, such as copyright infringement.
Also, the tool integrates multiple advanced technologies to protect and optimize musical and audiovisual content. Its core is a neural fingerprint scanner, based on artificial intelligence and large clusters of GPUs (graphics processing units), capable of tracking and comparing digital content with great precision.
Furthermore, given the expansion of artificially generated music, it incorporates a music detection tool based on neural networks that analyze acoustic, spectral and performance style features. The protection of the catalog is carried out through neural fingerprints and watermarks, which allow unauthorized uses to be identified and detected in real time.
Walter de Brouwer, co-founder and CEO of SoundPatrol spoke with Billboard Colombiaabout his artificial intelligence laboratory.
When did you start this lab and how long did it take you to develop the final version of the audio and video identification system?
Well, we started this process two years ago, we actually went through several phases. We first did neural identification, but the robustness was not very good. So we needed more data. We added more data, but the system still couldn’t detect certain things, like insertions or certain manipulations or pitch changes. So we decided to work with distillation and teach the model a little. We didn’t put an explicit rule on it, but instead established one expert for melody, another for rhythm, another for harmony, and then a voice engine.
Then we began to expand the metadata, to be able to analyze it better. We take all the possible metadata for a song, remove duplicates, and then connect it to one of our partners, Music Match, which is the database of all the lyrics. And suddenly we had much better results. It was as if the “student” had begun to understand. We no longer needed to intervene so much.
Then we repeated the process, because at that time we were still using probabilities. Like “the probability of copyright infringement in this tune is…” and we applied a softmax (a mathematical operation), something like 0.8 out of 1. We also had to consult the client, because we didn’t know everything: clients are different. Major record labels are different, and not all experts think the same. For example, music editors say “no, it’s the lyrics, it’s the composition,” while recording editors say “no, it’s the melody.”
So we were adjusting. We have a team at Sony and another at UMG. Every week we release new versions, and every month we get together to try something new. So they say: “we want more of this”, “not so much of this”. Everything that has to do with Content ID we don’t need it. And so, as a whole, the system grew.
In simple terms, how is the neural footprint embedded in the audio or video and how long does this process take?
Basically, we first start with the major label reference database, generate the embeddings, and store them. We then feed them into the fingerprint system along with all the experts. Now, the system no longer gives predictions, it simply says: “bam!, you found it.” It also helped a lot that we switched to scanning every 6 seconds, instead of every 20 or 10.
Sometimes it gets complicated when you take a song, put it through a codec, and convert it again; It is very difficult to detect that. In those cases we use 2 second fragments. Of course that costs a lot in processing.
It’s quite a revolution compared to the copy control methodology of the early 2000s, right?
Yeah, well, this is about both copyright and attribution. Sometimes it is clear that something is a copyright infringement and then it is copyright. But other times it’s different, like when I put a photo of myself in a diffuser and the result looks like me. In that case it’s attribution, not copyright infringement, because it’s not really me.
Yes, it’s difficult. Well, the simplest thing is rap. That’s very easy. On the other hand, EDM (Electronic Dance Music) is difficult. Because there are so many stems (audio files that break down a song into its individual components). Like the ring buffer overflows.
For example, Bob Dylan has five stems, but the Chainsmokers have 142 stems. Because they work on digital audio workstations.
How does the platform detect AI-generated music versus a human interpretation? And what is the rate of false positives and false negatives in the real world?
Yes, for each generator we make adversarial models. We distill these adversarial models and give it to the neural fingerprint. And so you can immediately say, for example, this is audio or, you know, this is something else.
For example, yesterday I found something interesting. The machine didn’t even hear the music. It said it was a copyright infringement. It saw it in the metadata. The metadata was fictitious.
In the event of a dispute, for example: an artist claims that SoundPatrol incorrectly marked their music, what mechanisms are in place to defend or analyze the decision?
We only detect… you know, we don’t intervene in that process. That’s handled by the record companies. They don’t even share that with us.
What other clients besides Universal Music Group and Sony are interested in your lab?
Well, there are more interested, but also because we use the same neural fingerprinting technology in video, in e-sports, you know, and because there is a lot of piracy there. So we went in by sound; For example, in F1 you can remove the comments and put comments in Chinese, but we still do the detection because we listen to the engines.
But this technology could have more scope, because perhaps in esports broadcasts, the transmission of the World Cup could be very interesting.
And it certainly will be… the artists are losing money, but this technology can also be used as a weapon for criminal fraud, for sabotage of scientific research, electoral fraud or simply citizen misinformation, because we have deep fakes and voice cloning.
I guess my prediction is that, by the end of the year, we will see regulation. And next year we will reach that productivity plateau: now there are rules, and then we can move forward. Because now we are in limbo. And pirates thrive in limbo.
