295. Find Median from Data Stream#

Difficulty: Hard

Link to Problem: To see the Find Median from Data Stream problem on LeetCode, click here!


The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.

  • void addNum(int num) adds the integer num from the data stream to the data structure.

  • double findMedian() returns the median of all elements so far. Answers within \(10^{-5}\) of the actual answer will be accepted.

Constraints:

  • \(-10^{5}\) <= num <= \(10^{5}\)

  • There will be at least one element in the data structure before calling findMedian.

  • At most \(5 * 10^4\) calls will be made to addNum and findMedian.

Follow-up:

  1. If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

  2. If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Probelm Explanation:#

The problem, “Find Median from Data Stream,” deals with efficiently computing the median of a sequence of numbers as they are incrementally provided. The median is the middle value in an ordered list of numbers. When the list has an even number of elements, the median is the average of the two middle values.

Here’s a breakdown of the problem:

  1. Initialization: You are asked to implement a MedianFinder class that supports the following operations:

    • MedianFinder(): Initializes the MedianFinder object.

    • addNum(int num): Adds an integer num from the data stream to the data structure.

    • findMedian(): Returns the median of all elements added to the data structure.

  2. Median Calculation: The findMedian operation should efficiently compute the median, and the result should have a precision of at least \(10^{-5}\) (i.e., answers within this range will be considered correct).

  3. Examples:

    • If you add numbers [1, 2], the median should be 1.5 because (1 + 2) / 2 = 1.5.

    • If you add numbers [1, 2, 3], the median should be 2 because it’s the middle value.

  4. Constraints:

    • The integers provided in the data stream are in the range from \(-10^5\) to \(10^5\).

    • There will be at least one element in the data structure before calling findMedian.

    • At most, \(5 * 10^4\) calls will be made to addNum and findMedian.

The problem can be solved by using two priority queues (heaps): one max-heap to store the smaller half of the numbers and one min-heap to store the larger half of the numbers. The max-heap ensures that the largest number in the smaller half is at the top, and the min-heap ensures that the smallest number in the larger half is at the top. These heaps are balanced to efficiently find the median when requested.

Solution:#

Here’s a Python function to implement this algorithm:

import heapq

class MedianFinder:

    def __init__(self):
        # Initialize two heaps: a max-heap (small_half) for the smaller half of numbers,
        # and a min-heap (large_half) for the larger half of numbers.
        self.small_half = []  # Max-heap for the smaller half of the numbers
        self.large_half = []  # Min-heap for the larger half of the numbers

    def addNum(self, num: int) -> None:
        # Add the number to the appropriate heap based on its value
        if not self.small_half or num <= -self.small_half[0]:
            # If the number is less than or equal to the current maximum in the smaller half,
            # add it to the smaller half (max-heap)
            heapq.heappush(self.small_half, -num)
        else:
            # Otherwise, add it to the larger half (min-heap)
            heapq.heappush(self.large_half, num)

        # Balance the heaps if necessary to ensure the size difference is at most 1
        if len(self.small_half) > len(self.large_half) + 1:
            # If the size of the smaller half is more than one greater than the size of the larger half,
            # move the maximum from the smaller half to the larger half to balance them.
            heapq.heappush(self.large_half, -heapq.heappop(self.small_half))
        elif len(self.large_half) > len(self.small_half):
            # If the size of the larger half is greater than the size of the smaller half,
            # move the minimum from the larger half to the smaller half to balance them.
            heapq.heappush(self.small_half, -heapq.heappop(self.large_half))

    def findMedian(self) -> float:
        if len(self.small_half) == len(self.large_half):
            # If both halves have the same size, there's an even number of elements,
            # so the median is the average of the top elements in both heaps.
            return (-self.small_half[0] + self.large_half[0]) / 2.0
        else:
            # If the smaller half has more elements, it contains the median (odd number of elements).
            return -self.small_half[0]

Explanation:#

The code defines a MedianFinder class for efficiently finding the median of a stream of numbers. It uses two heaps: a max-heap (small_half) to store the smaller half of the numbers and a min-heap (large_half) to store the larger half. Here’s how the code works:

  1. The __init__ method initializes the two heaps.

  2. The addNum method adds a number to the appropriate heap, ensuring that the heaps remain balanced. It does this by comparing the number to the current maximum in the smaller half and adding it to the appropriate heap. Then, it checks the sizes of the two heaps and rebalances them if necessary.

  3. The findMedian method calculates and returns the median. If both halves have the same size (even number of elements), it computes the average of the tops of both heaps. If the smaller half has more elements (odd number of elements), it returns the top of the max-heap, which is the median.

In summary, this code efficiently maintains two heaps to divide the elements into smaller and larger halves, allowing for quick median retrieval. It offers a time complexity of O(log N) for adding elements and O(1) for finding the median, where N is the total number of elements processed. The space complexity is O(1) as the space used by the two heaps is independent of the input size.

Test cases:#

Here’s how you can use this solution:

# Example 1: 

# Example usage:
medianFinder = MedianFinder()
medianFinder.addNum(1)
medianFinder.addNum(2)
print(medianFinder.findMedian())  # Output: 1.5
medianFinder.addNum(3)
print(medianFinder.findMedian())  # Output: 2.0
1.5
2

Time and Space Complexity Analysis#

Let’s analyze the time and space complexity of the MedianFinder class:

Time Complexity:

  1. MedianFinder() Initialization: This operation has a time complexity of O(1). It simply initializes the two heaps.

  2. addNum(int num): The time complexity for adding a number is O(log N), where N is the total number of elements processed so far. This is because, in the worst case, you might need to perform logarithmic operations on the heaps to maintain their balance.

  3. findMedian(): Finding the median has a time complexity of O(1). It simply involves accessing the tops of the two heaps, which takes constant time.

Space Complexity:

  1. MedianFinder() Initialization: The space complexity for initialization is O(1). It involves creating two empty heaps, which do not depend on the size of the data stream.

  2. addNum(int num): The space complexity for adding a number is also O(1). It involves adding the number to one of the two heaps, which do not significantly affect the overall space complexity.

  3. findMedian(): The space complexity for finding the median is O(1). It doesn’t require any additional data structures that grow with the input size.

In summary, the time complexity of the addNum operation is O(log N), and the time complexity of the findMedian operation is O(1). The space complexity is O(1) as well, as the space used by the two heaps is independent of the size of the data stream. This solution efficiently maintains and retrieves the median, making it suitable for the given problem constraints.

Challenging Exercises:#

  1. Generalized kth Element: Extend the MedianFinder class to support finding the kth element in the data stream efficiently. This means finding the kth smallest or kth largest element in the stream.

  2. Sliding Window Median: Given an array and a sliding window size, find the median within the window as it slides over the array. This is an extension of the problem to a moving window scenario.