Implement interrupted time series method #41

davidpomerenke · 2024-03-22T11:05:08Z

Part of #38

Interrupted time series. Does not deal with hidden confounding, thus tends to be overly optimistic about the impact of the events. This is a very simple concept

in https://github.com/pymc-labs/CausalPy

also simple to implement ourselves using any time-series prediction model, e.g. ARIMA

as a machine learning model, we can use random forests or boosted trees (from https://scikit-learn.org/stable/)

davidpomerenke · 2024-03-26T16:39:57Z

I've had a look at https://causalpy.readthedocs.io/en/latest/notebooks/its_pymc.html

It seems that their models are extremely simple and do not have any autoregressive components, but only intercept, numeric variables (which can model a very basic trend), and categorical variables (which can model e.g. monthly variations).

We can still use this by adding lags to the input data (using pandas), thus making the model autoregressive.

That way it should be possible to use time-series regression and time-series random/boosted forests.

ARIMA will be harder to implement this way. I'm not sure about how ARIMA works exactly, but maybe rather than using some ARIMA model, we can get ARIMA features and then use simple regression on these features.

Thoughts on this @kleinlennart ?

davidpomerenke · 2024-04-04T15:46:08Z

Made some mistakes when using lags. Better to use a proper time-series library. Best one I know is https://github.com/Nixtla/statsforecast (+ mlforecast, neuralforecast).

davidpomerenke · 2024-04-12T15:48:57Z

Works in principle using ARIMA. Unexpectedly, for my test data the impact begins on the day before the protest:

The test data is:

events = get_acled_events(
    countries=["Germany"], start_date=date(2023, 6, 1), end_date=date(2024, 2, 29)
)
events = events[
    events["organizations"].apply(lambda x: "Last Generation (Germany)" in x)
]
article_counts = get_mediacloud_counts(
    '"Letzte Generation"', date(2023, 1, 1), date(2024, 3, 31)
)

This gives us the following impacts, where the protest date is at x=4 (!!):

(x-axis = date, y-axis = number of articles about Last Generation)

But the impact starts at x=3. I have verified that this is not a bug.

Here are the respective time series where we use predictions starting at the protest date / one day earlier / two days earlier / three days earlier / four days earlier. Note that the impact estimate is quite wrong for the first plot, because it can already predict that the number of articles is going up from the data before the protest date.

protest_date = 0:

protest_date = 1:

protest_date = 2:

protest_date = 3:

protest_date = 4:

The above are all woth ARIMA. With the (non-optimized!) random forest code it is similar but not as continuous:

davidpomerenke · 2024-04-14T10:45:49Z

Interrupted time series is implemented and available via API.

Still missing:

tests for interrupted time_series
tests for impact.py
tests for trend.py
~~tests for new endpoints in api.py~~

Issues to think about:

individual impact estimates are sometimes negative -> don't make individual estimates
we can now get batches of individual impact estimates via the /impact endpoint, but these are only for the same outcome metric. For our visualizations we want batches of such estimates for diverse sets of protests (that have diverse outcome metrics, i.e. their respective names for now, or their respective topics). We might need to change the API design for this. [ORG] API endpoint design #36

davidpomerenke · 2024-04-14T21:46:54Z

Todo:

when there are multiple protess by the same group on the same day, divide the impact estimate among them

davidpomerenke · 2024-04-26T08:54:10Z

Is mostly finished but all data sources are currently down, so the tests don't run.

davidpomerenke self-assigned this Mar 22, 2024

kleinlennart closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement interrupted time series method #41

Implement interrupted time series method #41

davidpomerenke commented Mar 22, 2024 •

edited

Loading

davidpomerenke commented Mar 26, 2024 •

edited

Loading

davidpomerenke commented Apr 4, 2024

davidpomerenke commented Apr 12, 2024

davidpomerenke commented Apr 14, 2024 •

edited

Loading

davidpomerenke commented Apr 14, 2024

davidpomerenke commented Apr 26, 2024

Implement interrupted time series method #41

Implement interrupted time series method #41

Comments

davidpomerenke commented Mar 22, 2024 • edited Loading

davidpomerenke commented Mar 26, 2024 • edited Loading

davidpomerenke commented Apr 4, 2024

davidpomerenke commented Apr 12, 2024

davidpomerenke commented Apr 14, 2024 • edited Loading

davidpomerenke commented Apr 14, 2024

davidpomerenke commented Apr 26, 2024

davidpomerenke commented Mar 22, 2024 •

edited

Loading

davidpomerenke commented Mar 26, 2024 •

edited

Loading

davidpomerenke commented Apr 14, 2024 •

edited

Loading