@title[Efficient Video Processing]
R S Nikhil Krishna @rsnk96
Lokesh Kumar T @tlokeshkumar
Follow the presentation @ https://goo.gl/KM2GoR
@transition[none] @snap[north]
@snapend
- 2009: Youtube announces 1080p, 30 fps support
- 2015: Youtube announces 8k (4320p), 60 fps support
- 2016: AMD announces 16k(8640p), 240 fps VR support
AMD announces 16k support: https://www.tweaktown.com/news/53163/amd-radeon-pro-graphics-card-powers-16k-display-15-360-8640/index.html
+++ @transition[fade]
\begin{align} \small{\text{Demand}} & = \tiny{\frac{(7680 \times 4320) \times 120 \ \text{[2018]} }{(1280 \times 720) \times 60 \ \text{[2009]} }} & = & 72\text{x} \\ \small{\text{Supply}} \tiny{\textrm{(HDD ~100 USD)}} & = \tiny{\frac{3 \ \textrm{TB} \quad \text{[2018]} }{500 \ \textrm{GB} \quad \text{[2009]} } } & = & 6\text{x} \\ \small{\text{Supply}} \tiny{\textrm{(CPU ~300 USD)}} & = \tiny{\frac{3.7\ \textrm{GHz} \times 16\ \textrm{Threads}\ \text{[2018]} }{2.9 \ \textrm{Ghz} \times 8\ \textrm{Threads} \ \text{[2009]} }} & = & 2.55\text{x} \\ \end{align}
Note:
- This is not the full story, codecs compress this
Note:
- Programmer's dillemma
@title[Computer Vision]
+++?image=https://d.ibtimes.co.uk/en/full/1469739/vision-age-ultron.jpg @ul
- @fa[desktop white] @color[white](Vision ?)
- @fa[times fa-3x white] @ulend
Note:
- First thought when video processing comes to your head: FFmpeg
+++ @title[FFmpeg]
+++
-
The Cross-Platform Multimedia Swiss-Knife
-
Used by VLC, Youtube (controversial), etc
-
Efficiently uses your hardware
![normal-cv-htop](https://preview.ibb.co/nCYV2z/normal_cv_htop.png)![ffmpeg-htop](https://image.ibb.co/kb0zvK/ffmpeg_htop.png) -
Supports most modern day codecs
Note:
- Youtube controversy: FFmpeg has GPL license, so you don't need to openly claim that you use it. contributors to FFmpeg made a bug in the processing pipeline, not a fatal bug. Same bug shows up in FFmpeg too
+++
- Adding a filter to ffmpeg is non-trivial
- Lot of callback functions to define in
C
:init(), uninit(), query_formats(), config_props(), filter_frame()
- Lot of callback functions to define in
- @size[1](What if we want to do more than just adding a filter?)
Source: https://github.com/FFmpeg/FFmpeg/blob/master/doc/writing_filters.txt
Note:
- Any industry dealing in just giving clients videos ==> more than basic filters.
- ffmpeg accessible through terminal only==> pipe the data from numpy to ffmpeg from within Python
+++
Note:
- Best Part: No need to write it in C
- Next Slide: Codecs intro
+++
https://github.com/Netflix/vmaf+++?image=https://preview.ibb.co/dhLF3e/18_10_04_1020_477.png
Note:
- with its default backend
+++?image=https://preview.ibb.co/k7JHkK/h264.png
+++ @transition[none]
+++
+++
Setting up the pipe for a 460
x360
input in the format BGR,BGR,BGR
command = [ 'ffmpeg',
'-f', 'rawvideo', '-vcodec','rawvideo',
'-s', '420x360', # size of one frame
'-pix_fmt', 'bgr24',
'-r', '24', # frames per second
'-i', '-', # The imput comes from a pipe
'-an', # Tells FFMPEG not to expect any audio
'-vcodec', 'libx265',
'my_output_videofile.mp4' ]
pipe = sp.Popen( command, stdin=sp.PIPE, stderr=sp.PIPE)
+++
Now, to write frames,
pipe = sp.Popen(command, stdin=sp.PIPE,stderr=sp.PIPE)
while (cap.isOpened()):
ret, frame = cap.read()
# Do some stuff
pipe.proc.stdin.write( my_image_array.tostring() )
Read Zulko's (MoviePy) blog for more Source: http://zulko.github.io/blog/2013/09/27/read-and-write-video-frames-in-python-using-ffmpeg/
Note:
- sp is subprocess lib
- What these 3 - 4 lines do: Let you access SotA codecs from within Python
+++
@title[Parallelism in Video Processing]
+++
+++?image=https://media.giphy.com/media/pzAU8uEKFcRAsTjdKQ/giphy.gif @ul
- @fa[times fa-4x black] @ulend
Note:
- FFmpeg speeds up processing by parallelizing the processing of each frame, rather than parallelizing across frames. But parallelizing across frames is a much more fundamental approach, if you think about it, you handle each frame independently.
- Video compression is generally inter-frame, so we store some key frames and note how successive frames differ from the previous ones.
+++?image=https://media.giphy.com/media/2lh9grOpTCXnaP3zgw/giphy.gif
Note:
- take 10k frames as base
+++
def process_video():
cap = cv2.VideoCapture("input_file.mp4")
out = cv2.VideoWriter("output_file.avi", ...)
while (cap.isOpened()):
ret, frame = cap.read()
# ... DO SOME STUFF TO frame... #
out.write(frame)
process_video()
+++
def process_video(group_number):
cap = cv2.VideoCapture("input_file.mp4")
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump * group_number)
proc_frames = 0
out = cv2.VideoWriter("output_{}.avi".format(group_number), ...)
while proc_frames < frame_jump:
ret, frame = cap.read()
# ... DO SOME STUFF TO frame ... #
proc_frames += 1
out.write(frame)
return None
import multiprocessing as mp
num_processes = mp.cpu_count()
num_frames = cv2.VideoCapture("input_file.mp4").get(cv2.CAP_PROP_FRAME_COUNT)
frame_jump = num_frames // num_processes
p = mp.Pool(num_processes)
p.map(process_video, range(num_processes))
with open("temp_files.txt", "w") as f:
for t in ["output_{}.avi".format(i) for i in range(num_processes)]:
f.write("file {} \n".format(t))
ffmpeg_command = "ffmpeg -f concat -safe 0 -i temp_files.txt -vcodec copy"
sp.Popen(ffmpeg_command, shell=True).wait()
@[12-13](Number of parallel sub-processes) @[14-15](Number of frames for each sub-process to process) @[1-10](We modify the original function a bit) @[1,3-6](The important changes) @[17-18](And now we parallelize the processing of the video) @[20-24](Merge the videos)
+++
The # ... DO SOME STUFF ...
:
Note:
- Even for a small 640p video(pic), we can see significant speed-ups
- Speed up of ~2.3x on 2 core machine with just 10 extra lines of code
- Also note that in this method, we've completely stayed within Python, without coming out unlike piping into FFmpeg
+++
+++
---?image=https://ksassets.timeincuk.net/wp/uploads/sites/54/2017/12/RvxjvHGLJgCCi7PkDEmMWP-970-80-920x517.jpg @transition[none] @title[GPU] @snap[north-west headline] @css[heading](Parallelism in Video Processing) @snapend
@snap[south-east] @color[white](GPU Parallelism) @snapend
@snapend
+++
Whatever we've seen up till now is inter-frame parallelization
+++?image=https://image.ibb.co/dZgSYe/18_10_04_546_519.png&size=auto 70% @transition[none] @snap[north] Intra-Frame Parallelization @snapend
+++
Hidden Problems of parallelizing your video pipeline
@ul
- Reassembly of the video fragments in the correct order
- Assumption: Cores are powerful enough to process entire frames by themselves @ulend
+++
+++
@cuda.jit
def image_conv(ip_img, kernel, op_img, channels, w, h):
r, c = cuda.grid(2) # Finding the global position of the thread
kernel_c = cuda.const.array_like(conv_window)
for b in range(BATCH):
# Do some stuff
@[1](Decorator to make python function into a CUDA Kernel) @[2](Function Header) @[3](Find the global position of the thread) @[5](Push the kernel to constant memory for faster read and write access) @[7](Iterate over the batch of images) @[8](Perform per frame processing ops)
+++
@ul
- CPU Mem Access Speeds: RAM < Cache < Register
- GPU Mem Access Speeds: Global < Constant < Shared @ulend
+++
@ul
- On chip memory: Much faster access rates than global memory
- Latency: ~ 1/100x (uncached global memory latency)
- Load the part of input image used by a block into shared memory
- Read from shared memory for further computation in the kernel @ulend
+++
Source: http://www.khshim.com/archives/681
+++
Source: https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
+++
@ul
- Implicit data transfers are blocking or synchronous transfers
- Kernel Launches are asynchronous
- Idea: Overlap kernel execution with data transfers each running in different streams. @ulend
+++
for i in range(n_streams):
streams.append(cuda.stream())
ip_img = cuda.device_array((n_streams, BATCH, rows, cols, 3))
op_img = cuda.device_array((n_streams, BATCH, rows, cols, 3))
for i, idx in enumerate(streams):
ip_img[idx] = cuda.to_device(img_batch, stream=i)
kernel[blockspergrid, threadsperblock, i](ip_img[idx], op_img[idx], ...)
op_img_h = op_img[idx].copy_to_host(stream=i)
@[1-2](Creating the streams required for processing) @[4-5](One way of alloting your input and output data arrays in GPU) @[7](Iterate over the streams) @[8-10](Do the require computation with stream=i)
+++ @title[Results - GPU]
@title[Putting it all together]
+++
+++
@ul
- FFmpeg - latest codecs, best compression methods with the best quality
- Multiprocessing the
ndarray
s - lets us do it really fast- How to avoid having to write to hard disk?
- Figuring out how to get the best of both worlds --> Try it out yourselves! You can check your results here @ulend
+++ @title[Results]
+++?image=http://quotesideas.com/wp-content/uploads/2015/10/mMaddc0.jpg
+++
@ul
- Pipe into memory if I/O is the bottleneck and you have fast memory access rates
- No problem in piping into I/O directly if you have an NvME SSD or any other fast I/O device @ulend