Steal this human-machine annotation startup idea

Recently I’ve been quite busy with various business ideas and as we know, time, focus and energy are scarce resources and I would never have the ability to implement all of them successfully. therefore, I just share with you one of them that might have some potential and value to turn into a viable business.

The internet search is pretty much about searching text. We know that many of the algorithms we have created are mainly successful for content that is processed in the linguistic domain. Therefore, many applications responsible for searching images or sound for example, usually rely on textual interpretations of the media resource in question: what textual content is present on the same page, what textual content is used in linking to such a resource, what textual content is extracted from the visual images or sound by humans or computers (like geolocation, where a certain picture was taken, tags etc). This is the easy approach.

Now, I know there are a few projects that try to outsource the task of describing images and audio in the textual form to their users. BBC for example, has a project where they put their radio programs online and let people annotate the audio with descriptions of various segments in the audio, to enable much more accurate navigation and search among the radio programs online. A similar navigational experience can be seen on the TED conference site for videos: in the video player, much of the talks are split into seekable chunks with certain textual ques on what a certain part of the speech might include. This trendemously improves the viewing experience.

 audio annotation edit open small Steal this human machine annotation startup idea

There are similar projects for images, where people use tags or other ways to describe what they see. Flickr in a sense guides people to do this for themselves in a less structured way to gain insight in the relationship among vast ammounts of images they have or other people might have on a similar context.

If you think of the Flickr or BBC example, much of the analysis is left for the viewer who perceives the content. I believe there is a cognitive threshold for participating in such annotation activity, unless you get some machine guidance that steers your attention to certain manageable chunks. I also imagine that having computers doing such annotation completely automatically is a really hard computational problem and still out of our reach. Therefore, we have to join machines and people in a symbiosis that can solve the problem much more quicker and with lesser threshold than what one or another can do simply alone.

Let’s take an audio podcast, where you have let’s say 3 people discussing a certain topic. It’s 90 minutes long. You listen through it. It’s incredibly hard to find the exact spot where someone was saying something interesting, that you would like to refer to afterwards. You might remember who said it and in what context, but you have no idea of time, where it actually happened. As a result, seeking to the exact spot is hard and linking to it is even harder (as far as I know, there is only a few services, where you can link to an exact spot in music or video streams).

Now, imagine a computer program that can identify from the audio levels, patterns and phase where certain boundaries might reside. The result would be to split the stream into blocks. You would end up with blocks of different people speaking, but the computer doesn’t have much of a chance to identify who was actually speaking at what parts. Such algorithms already exists, see for example here or here. You would combine this data with audience response on who is speaking at certain points of time.

Then, you would use speech-to-text analysis to get the transcript. This would of course include a lot of errors and be inaccurate in its ability to crystallize what is going on at certain parts, therefore you would algorithmically extract most salient themes and topics for each part as keywords that might be able to describe the content. Afterwards you would ask people to use those keywords to recall what were the main topics there. I would suppose this would give one enough cognitive support to recall what was going on at certain parts.

In conclusion, you would use a computer to set the boundaries for podcasts and extract the keywords (all tedious work for humans), and use humans to finally correctly describe the content (all tedious work for computers). This could be done for video and images too, with alternative strategies.

So, what’s the business idea? You take this approach and you turn it into a service that other service providers can tap into to annotate their content with increased precision and reduced threshold for contribution. It could be network-based and include APIs, so that other service providers can use it to make the most out of their content in cooperation with their users. I would guess this approach would make it so simple for the viewer/listener to annotate the content that it would double the participation and quality of results for such activity in the bottom line. The service would provide open standards to describe content in such a way, that is usable in other contexts (let’s say, your music player). It would behave like CDDB does for music track listings based on checksums, but this time for their exact contents.

Eventually, you would make video and audio increasingly searchable and useful. It combines computers with humans to make the job easier for both.

Steal this startup idea, or let me know if it already exists. I believe it would be really useful.

Share and Enjoy:
  • services sprite Steal this human machine annotation startup idea
  • services sprite Steal this human machine annotation startup idea
  • services sprite Steal this human machine annotation startup idea
  • services sprite Steal this human machine annotation startup idea
  • services sprite Steal this human machine annotation startup idea

Tags: ,

  • Tarmo Toikkanen

    I’d like to point out, that any algoritmic solution needs to be cheaper than $0.75/minute, because that’s the rate at which you can get a human to transcribe a podcast with basically no errors (eg.

    Also, there are services that search podcasts and vodcasta, do speech-to-text, then allow to search for spoken content in the media. Examples are PodZinger, BlinkX and Pluggd, and I’m sure there are many others.

    Also, there are experimental search engines that allow you to search for images based on doodles or sketches, so image search doesn’t have to be only textual.

  • Eric Madsen

    Something not unlike what you have described is included in the new version of Adobe Premiere as part of their CS4 suite. A very powerful tool that converts transcribes the dialogue in the video clips into text which can then be searched allowing editors/directors/producers to quickly find clips. What’s next for software, indexing meaning?