Something that perhaps isn't much of a consideration with YouTube proper is the ability to transcode/down-convert on the fly. I reason that this is critically important to YTTV.
Transcoding into multiple bitrates is definitely important, but you only do it once and then replicate the streams out to your severs at the edge that fulfill client requests.
All of the streaming services use MPEG DASH or HTTP Live Streaming (HLS), including Youtube. They get a master feed either via satellite or network ingestion, they then process that feed and break it down into 2-10 second segments of each of the following components:
1) Encryption Keys
2) Audio Tracks
3) Video tracks
4) Subtitles
When you select a channel to start streaming, the player app (javascript in browser, or streaming box app) makes the request to the API front-end and gets a manifest file back. The manifest file contains links to APIs that generate
M3U8 playlists which have links to fetch all of the smaller segments. The manifest file looks like this:
Paste2.org - Viewing Paste 8G84cPzX
Even if you aren't a programmer, you can probably skim through that XML and start to see what's going on.
SegmentTimeline - This defines the start time in UTC (segmentIngestTime) and indexes the time segments that the client is able to grab. In this example most of these are 5 second segments d="5005" for duration 5005ms.
AdaptationSet - This defines the segments of media content.
The first stanza has Primary audio (Primary.5) with 2 DRM definitions (playready and widevine).
The second stanza is for Secondary audio.
The third stanza starts defining the video feeds available:
Code:
<Representation id="142" codecs="avc1.4d4015" width="426" height="240" startWithSAP="1" maxPlayoutRate="1" bandwidth="258000" frameRate="30">
<Representation id="143" codecs="avc1.4d401e" width="640" height="360" startWithSAP="1" maxPlayoutRate="1" bandwidth="646000" frameRate="30">
<Representation id="144" codecs="avc1.4d401f" width="854" height="480" startWithSAP="1" maxPlayoutRate="1" bandwidth="1171000" frameRate="30">
<Representation id="161" codecs="avc1.42c00b" width="256" height="144" startWithSAP="1" maxPlayoutRate="1" bandwidth="124000" frameRate="30">
<Representation id="145" codecs="avc1.4d401f" width="1280" height="720" startWithSAP="1" maxPlayoutRate="1" bandwidth="2326000" frameRate="30">
<Representation id="146" codecs="avc1.640028" width="1920" height="1080" startWithSAP="1" maxPlayoutRate="1" bandwidth="4347250" frameRate="30">
<Representation id="384" codecs="avc1.4d4020" width="1280" height="720" startWithSAP="1" maxPlayoutRate="1" bandwidth="3481000" frameRate="60">
<Representation id="385" codecs="avc1.64002a" width="1920" height="1080" startWithSAP="1" maxPlayoutRate="1" bandwidth="5791000" frameRate="60">
The fourth stanza has the closed caption definitions.
When your client starts streaming, it grabs from the top playlist that matches any bitrate, resolution, or codec limitations that have been placed on the client, and tries to fill the buffer with several ~5s content segments as quickly as possible (video, audio, and captions are muxed together locally and played back "gapless"). It monitors the download rate of the ~5 second clips and if it determines it isn't achieving high enough throughput to keep the buffer full it will down-select a m3u8 playlist that has a target bitrate (bandwidth= statement) that matches its calculated available bandwidth.
Have you determined what the typical delay (if measurable) is with "live" programming?
It's 30-40 seconds behind a side-by-side DirecTV live feed (I think this is variable by how good of an initial "burst" download you get). I don't think it's possible for that to ever improve with this architecture. On the supply side you have encoding delay to produce segment files and apply local DRM, and those files have to be distributed out to your CDN servers at the edge so that clients can request them. On the client side your client still needs to fetch at least 10-15 seconds of video to have enough data to calculate effective bandwidth and have enough content in the buffer to maintain continuous playback.
I've said it before - when you break down the mechanics of how this all works, it's sort of amazing that live video over the Internet is even a thing.
Edit: Side note, on the manifest file you'll notice that all of the video content is being served from the googlevideo.com domain.