Scrubbing of (poor) data.

I have sensors - a great many - which report numbers daily. There's a long piepline and many processes between these sensorss and the ~daily reports I get, and sometimes weird spikes and dips happen in the numbers: often fixed the very next day.

This can at times be something which has to be scrubbed and/or dealth-with to prevent downstream work from making (poor-)data-driven decisions rather than (good-)data-driven decisions.

Here's a simplified example.

In [1]:
import numpy as np
import pprint
ignore_dates = [20160225, 20160226]

readings = [
    [20160221, 10],
    [20160225, 20],
    [20160226, 10]    
]

def report(vals, desc="readings"):
    fmt = "{:>30}: {}"
    print fmt.format(desc, "\n{}".format(pprint.pformat(vals)))
    print fmt.format("np.array[:,1]", np.array(vals)[:,1])
    print fmt.format("np.median of likes", np.median(np.array(vals)[:,1]))

    deltas = [val[1] - vals[max(idx - 1, 0)][1] for idx, val in enumerate(vals)]
    print fmt.format("deltas", deltas)

    pos_deltas = [d for d in deltas if d > 0]
    print fmt.format("pos_deltas", pos_deltas)

    print fmt.format("np.median of pos_deltas", np.median(pos_deltas) if pos_deltas else None)
    print fmt.format("np.percentile(pos_deltas, 75)", np.percentile(pos_deltas, 75) if pos_deltas else None)

report(readings)
readings = [v for v in readings if not v[0] in ignore_dates]
print
report(readings, "readings ignoring {}".format(ignore_dates))
print "\nA more complete example"

readings = [
    [20160221, 10],
    [20160225, 20],
    [20160226, 10],
    [20160301, 12],
    [20160305, 13],
    [20160310, 12],
    [20160320, 13],
    [20160328, 15],
    [20160402, 17]
]
report(readings)
readings = [v for v in readings if not v[0] in ignore_dates]
print
report(readings, "readings ignoring {}".format(ignore_dates))
                      readings: 
[[20160221, 10], [20160225, 20], [20160226, 10]]
                 np.array[:,1]: [10 20 10]
            np.median of likes: 10.0
                        deltas: [0, 10, -10]
                    pos_deltas: [10]
       np.median of pos_deltas: 10.0
 np.percentile(pos_deltas, 75): 10.0

readings ignoring [20160225, 20160226]: 
[[20160221, 10]]
                 np.array[:,1]: [10]
            np.median of likes: 10.0
                        deltas: [0]
                    pos_deltas: []
       np.median of pos_deltas: None
 np.percentile(pos_deltas, 75): None

A more complete example
                      readings: 
[[20160221, 10],
 [20160225, 20],
 [20160226, 10],
 [20160301, 12],
 [20160305, 13],
 [20160310, 12],
 [20160320, 13],
 [20160328, 15],
 [20160402, 17]]
                 np.array[:,1]: [10 20 10 12 13 12 13 15 17]
            np.median of likes: 13.0
                        deltas: [0, 10, -10, 2, 1, -1, 1, 2, 2]
                    pos_deltas: [10, 2, 1, 1, 2, 2]
       np.median of pos_deltas: 2.0
 np.percentile(pos_deltas, 75): 2.0

readings ignoring [20160225, 20160226]: 
[[20160221, 10],
 [20160301, 12],
 [20160305, 13],
 [20160310, 12],
 [20160320, 13],
 [20160328, 15],
 [20160402, 17]]
                 np.array[:,1]: [10 12 13 12 13 15 17]
            np.median of likes: 13.0
                        deltas: [0, 2, 1, -1, 1, 2, 2]
                    pos_deltas: [2, 1, 1, 2, 2]
       np.median of pos_deltas: 2.0
 np.percentile(pos_deltas, 75): 2.0

When skipping past the spike day, it's often helpful to remember that there's a very large positive first (time) differential for that day, and, an equally large negative first (time) differential for the next day: hence the simple approach is to skip both the day of the (up) spike and the (rectifying) day after.