Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement interrupted time series method #41

Closed
davidpomerenke opened this issue Mar 22, 2024 · 6 comments
Closed

Implement interrupted time series method #41

davidpomerenke opened this issue Mar 22, 2024 · 6 comments
Assignees

Comments

@davidpomerenke
Copy link
Collaborator

davidpomerenke commented Mar 22, 2024

Part of #38

Interrupted time series. Does not deal with hidden confounding, thus tends to be overly optimistic about the impact of the events. This is a very simple concept

@davidpomerenke davidpomerenke self-assigned this Mar 22, 2024
@davidpomerenke
Copy link
Collaborator Author

davidpomerenke commented Mar 26, 2024

I've had a look at https://causalpy.readthedocs.io/en/latest/notebooks/its_pymc.html

It seems that their models are extremely simple and do not have any autoregressive components, but only intercept, numeric variables (which can model a very basic trend), and categorical variables (which can model e.g. monthly variations).

We can still use this by adding lags to the input data (using pandas), thus making the model autoregressive.

That way it should be possible to use time-series regression and time-series random/boosted forests.

ARIMA will be harder to implement this way. I'm not sure about how ARIMA works exactly, but maybe rather than using some ARIMA model, we can get ARIMA features and then use simple regression on these features.

Thoughts on this @kleinlennart ?

@davidpomerenke
Copy link
Collaborator Author

Made some mistakes when using lags. Better to use a proper time-series library. Best one I know is https://github.com/Nixtla/statsforecast (+ mlforecast, neuralforecast).

@davidpomerenke
Copy link
Collaborator Author

Works in principle using ARIMA. Unexpectedly, for my test data the impact begins on the day before the protest:

The test data is:

events = get_acled_events(
    countries=["Germany"], start_date=date(2023, 6, 1), end_date=date(2024, 2, 29)
)
events = events[
    events["organizations"].apply(lambda x: "Last Generation (Germany)" in x)
]
article_counts = get_mediacloud_counts(
    '"Letzte Generation"', date(2023, 1, 1), date(2024, 3, 31)
)

This gives us the following impacts, where the protest date is at x=4 (!!):

(x-axis = date, y-axis = number of articles about Last Generation)

Image

But the impact starts at x=3. I have verified that this is not a bug.

Here are the respective time series where we use predictions starting at the protest date / one day earlier / two days earlier / three days earlier / four days earlier. Note that the impact estimate is quite wrong for the first plot, because it can already predict that the number of articles is going up from the data before the protest date.

protest_date = 0:

Image

protest_date = 1:

Image

protest_date = 2:

Image

protest_date = 3:

Image

protest_date = 4:

Image

The above are all woth ARIMA. With the (non-optimized!) random forest code it is similar but not as continuous:

Image

@davidpomerenke
Copy link
Collaborator Author

davidpomerenke commented Apr 14, 2024

Interrupted time series is implemented and available via API.

Still missing:

  • tests for interrupted time_series
  • tests for impact.py
  • tests for trend.py
  • tests for new endpoints in api.py

Issues to think about:

  • individual impact estimates are sometimes negative -> don't make individual estimates
  • we can now get batches of individual impact estimates via the /impact endpoint, but these are only for the same outcome metric. For our visualizations we want batches of such estimates for diverse sets of protests (that have diverse outcome metrics, i.e. their respective names for now, or their respective topics). We might need to change the API design for this. [ORG] API endpoint design #36

@davidpomerenke
Copy link
Collaborator Author

Todo:

  • when there are multiple protess by the same group on the same day, divide the impact estimate among them

@davidpomerenke
Copy link
Collaborator Author

Is mostly finished but all data sources are currently down, so the tests don't run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants