comments | difficulty | edit_url | tags | |
---|---|---|---|---|
true |
Medium |
|
Table: Posts
+-------------+---------+ | Column Name | Type | +-------------+---------+ | post_id | int | | user_id | int | | post_date | date | +-------------+---------+ post_id is the primary key (column with unique values) for this table. Each row of this table contains post_id, user_id, and post_date.
Write a solution to find users who demonstrate bursty behavior in their posting patterns during February 2024
. Bursty behavior is defined as any period of 7 consecutive days where a user's posting frequency is at least twice to their average weekly posting frequency for February 2024
.
Note: Only include the dates from February 1
to February 28
in your analysis, which means you should count February as having exactly 4
weeks.
Return the result table orderd by user_id
in ascending order.
The result format is in the following example.
Example:
Input:
Posts table:
+---------+---------+------------+ | post_id | user_id | post_date | +---------+---------+------------+ | 1 | 1 | 2024-02-27 | | 2 | 5 | 2024-02-06 | | 3 | 3 | 2024-02-25 | | 4 | 3 | 2024-02-14 | | 5 | 3 | 2024-02-06 | | 6 | 2 | 2024-02-25 | +---------+---------+------------+
Output:
+---------+----------------+------------------+ | user_id | max_7day_posts | avg_weekly_posts | +---------+----------------+------------------+ | 1 | 1 | 0.2500 | | 2 | 1 | 0.2500 | | 5 | 1 | 0.2500 | +---------+----------------+------------------+
Explanation:
- User 1: Made only 1 post in February, resulting in an average of 0.25 posts per week and a max of 1 post in any 7-day period.
- User 2: Also made just 1 post, with the same average and max 7-day posting frequency as User 1.
- User 5: Like Users 1 and 2, User 5 made only 1 post throughout February, leading to the same average and max 7-day posting metrics.
- User 3: Although User 3 made more posts than the others (3 posts), they did not reach twice the average weekly posts in their consecutive 7-day window, so they are not listed in the output.
Note: Output table is ordered by user_id in ascending order.
We can use self-join to connect the Posts
table with itself. The connection condition is p1.user_id = p2.user_id
and p2.post_date
is between p1.post_date
and 6 days after p1.post_date
. Then we group the connection results by p1.user_id
and p1.post_id
to count the number of posts for each user within 7 days of each day. We save this result in table P
.
Next, we count the average number of posts per week for each user in February 2024 and save it in table T
. Note that we need to find records where post_date
is between 2024-02-01
and 2024-02-28
, group the records by user_id
, then count the number of posts for each user, and finally divide by 4
to get the average number of posts per week. We save this result in table T
.
Finally, we connect tables P
and T
with the condition P.user_id = T.user_id
, then group by user_id
to count the maximum number of posts within 7 days for each user. We then filter out records that meet the condition max_7day_posts >= avg_weekly_posts * 2
to get the result. Note that we need to sort in ascending order by user_id
.
# Write your MySQL query statement below
WITH
P AS (
SELECT p1.user_id AS user_id, COUNT(1) AS cnt
FROM
Posts AS p1
JOIN Posts AS p2
ON p1.user_id = p2.user_id
AND p2.post_date BETWEEN p1.post_date AND DATE_ADD(p1.post_date, INTERVAL 6 DAY)
GROUP BY p1.user_id, p1.post_id
),
T AS (
SELECT user_id, COUNT(1) / 4 AS avg_weekly_posts
FROM Posts
WHERE post_date BETWEEN '2024-02-01' AND '2024-02-28'
GROUP BY 1
)
SELECT user_id, MAX(cnt) AS max_7day_posts, avg_weekly_posts
FROM
P
JOIN T USING (user_id)
GROUP BY 1
HAVING max_7day_posts >= avg_weekly_posts * 2
ORDER BY 1;
import pandas as pd
def find_bursty_behavior(posts: pd.DataFrame) -> pd.DataFrame:
# Subquery P
p1 = pd.merge(posts, posts, on="user_id", suffixes=("_1", "_2"))
p1 = p1[
p1["post_date_2"].between(
p1["post_date_1"], p1["post_date_1"] + pd.Timedelta(days=6)
)
]
p1 = p1.groupby(["user_id", "post_id_1"]).size().reset_index(name="cnt")
# Subquery T
t = posts[
(posts["post_date"] >= "2024-02-01") & (posts["post_date"] <= "2024-02-28")
]
t = t.groupby("user_id").size().div(4).reset_index(name="avg_weekly_posts")
# Joining P and T
merged_df = pd.merge(p1, t, on="user_id", how="inner")
# Filtering
filtered_df = merged_df[merged_df["cnt"] >= merged_df["avg_weekly_posts"] * 2]
# Aggregating
result_df = (
filtered_df.groupby("user_id")
.agg({"cnt": "max", "avg_weekly_posts": "first"})
.reset_index()
)
result_df.columns = ["user_id", "max_7day_posts", "avg_weekly_posts"]
# Sorting
result_df.sort_values(by="user_id", inplace=True)
return result_df