Nice question.
I was able to easily reproduce the basic issue with the below.
CREATE TABLE #Orders
(
EmpID INT,
Filler CHAR(1000)
)
CREATE CLUSTERED INDEX ix
ON #Orders(EmpID)
INSERT INTO #Orders
(EmpID)
SELECT TOP 1000000 (ROW_NUMBER() OVER (ORDER BY @@SPID)-1) / 50
FROM master..spt_values v1,
master..spt_values v2
SET statistics io ON
--Scan count 1, logical reads 27, CPU time = 0 ms, elapsed time = 4 ms.
SELECT TOP 1 EmpID,
COUNT(*)
FROM #Orders
GROUP BY EmpID
OPTION (MAXDOP 1)
--Scan count 1, logical reads 143,283, CPU time = 281 ms, elapsed time = 287 ms.
SELECT COUNT(*)
FROM #Orders
OPTION (MAXDOP 1)
--Scan count 7, logical reads 83,996 - DOP 6 CPU time = 375 ms, elapsed time = 115 ms.
SELECT TOP 1 EmpID,
COUNT(*)
FROM #Orders
GROUP BY EmpID
ORDER BY EmpID
OPTION (querytraceon 8649)
The above example creates 20,000 EmployeeId with 50 rows per group. A full table scan gave 143,283 reads and the parallel top 1 plan a variable number of reads per execution with 85,000 being typical. The serial plan gave 27 logical reads.
I was easily able to get the parallel top 1 plan to read the whole table by reducing the number of groups but wanted to demonstrate that this isn't always the case.
The TOP iterator can stop requesting any more rows after a single row is received (as TOP 1). As there is an index on EmpId and that is the desired sort order the serial plan just processed the index in key order and no more rows are requested from
the index scan after the stream aggregate emits the first row (i.e. the serial plan only needs to read the 50 rows for the first employeeid plus 1 more to know the group has changed then stops).
When I forced the parallel plan I got an actual execution plan showing that 519,730 rows had been read (i.e. more than 10,000 groups) from the index scan. The parallel page supplier had processed the index in order and distributed these among 6 threads
- each of which calculated a partial aggregate. The parent operator to that is an order preserving repartition streams operator that uses hash partitioning to distribute rows across a different 6 threads to its consumer (guaranteeing that all rows from
the same group will end up on the same thread). The consumer iterator is a second Stream Aggregate operator that calculates the final aggregate totals per group. 73 rows were emitted from that into the (again order preserving) gather streams operator and that
in turn output a single row to the TOP.
Rows get pushed across parallel exchanges in packets rather than individually so
it is expected that there will be some lumpiness. The TOP 1 iterator calls GetRow but the GatherStreams operator must wait for its first packet to arrive from each thread (as it is order preserving here)
before it can output the first row. It seems as though the subtree below did a lot of unneeded work however in outputting the top 73 groups from which the top 1 was finally chosen. There is a
known bug where some threads can fail to stop at all but I wasn't hitting it here. So I just have to attribute this to the "stopping distance" mentioned in my first link.