These tests essentially duplicate the logic for how the pytest commands
are generated and then compare the duplicated logic to the original
logic. This is pretty brittle and since we're running all the variants
here pretty regularly I think we have other ways of knowing if we caused
the command to break.
I don't think these tests are providing sufficient value to merit their
added complexity.