In your scenario, I believe the underlying issue is to capture more frames per clip than just one, which we plan to do. When we are able to do so, regardless of car or no car, we would provide more accurate determination of human activity within the clip.
We have vehicle detection already...