Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about implementation details. #15

Open
sweet132 opened this issue Nov 14, 2023 · 17 comments
Open

Question about implementation details. #15

sweet132 opened this issue Nov 14, 2023 · 17 comments

Comments

@sweet132
Copy link

Hello, I admit that this is a good job.
However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ).
I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

@shams2023
Copy link

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Hello, what do you need to prepare to run this project? I downloaded it for several days, but still couldn't successfully run it. How can I run this project?

@Tiiivoo
Copy link

Tiiivoo commented Nov 15, 2023

Hello, may I ask what version of PyTorch you are using? Have you encountered any issues when using batch_first=True?

@sweet132
Copy link
Author

I just downloaded the code and data. It looks like 8 GPUs with 256 batch_size is essential for reproducing the project. @shams2023

@sweet132
Copy link
Author

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

@shams2023
Copy link

代码基于 CLIP4Clip,torch 版本为 1.11.0,cuda 为 11.6
Thank you for your answer|
The author mentioned in the article that the interaction module used is the co attention transformer, which part of the code is specifically implemented in?

@shallowdream66
Copy link

@sweet132 Have you ever noticed that how much graphics memory do you use when batch_size=128? When I turn down the batch_size_val, I still report the error of "CUDA out of memory when evaluating. Testing model at the end!".

@sweet132
Copy link
Author

The modeling section isin modeling.py, which you can find what you want @shams2023

@sweet132
Copy link
Author

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

@shallowdream66
Copy link

If you have 8GPUs for batch_size=256, the memory of GPU will be around 20GB. You can reference as the setting. I am not sure why it takes up so much memory since it just needs around 11GB for CLIP4Clip @shallowdream66

I am also very confused. Compared to CLIP4clip, it takes up a lot of memory and training time

@Tiiivoo
Copy link

Tiiivoo commented Nov 29, 2023

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

The code is based on CLIP4Clip, Version of torch is 1.11.0 and cuda is 11.6 @Tiiivoo

Hello, regarding the 'msrvtt_train_with_vitb32_max1_title_titles.json' file, I didn't understand where the 'titles' data comes from. It seems that MSR-VTT dataset doesn't have this part. If the 'titles' section is obtained through web crawling, why are there 30 of them?

@whwu95
Copy link
Owner

whwu95 commented Nov 29, 2023

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I'm glad to hear that you've successfully reproduced our results. Regarding the batch size issue, we apologize for any confusion, and it may indeed be an oversight in the paper. Please consider our code as the practical reference.

@sweet132
Copy link
Author

Thank you for your reply, although I achieved similar results to the paper on msrvtt, I got poor results on msvd(46.1), where I trained directly on the raw data, while for vatex(62.0) dataset, I used the extracted frames you uploaded. I'm not sure why is that.
@whwu95

@sweet132
Copy link
Author

Hello, I suggest you refer to the paper, the titles are generated by model (gpt-2 or clip) @Tiiivoo

@shams2023
Copy link

建模部分在modeling.py中,你可以在里面找到你想要的@shams2023

How to complete this task for a single card 3090?

@fazliimam
Copy link

+1

@shams2023
Copy link

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

Hello, I admit that this is a good job. However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ). I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

I want to seek some help from you!
In train_ In video.py, the first 5 epochs are used to train the video-query branch, so why do we calculate the caption value in the forward propagation of the model?

As shown in the following figure:
image

Aren't these first 5 epochs only used to train text encoders for text? (i.e., query encoder), then if caption is added at this point, does not it mean that the caption encoder has also been trained? I am confused about this part and hope to receive your help!
Thank you again for disturbing your time. Thank you!
(这前5个epoch对于文本来说,不是只训练文本编码器的吗?(即 query encoder),那么此时如果加入了caption,那么不就也训练了caption encoder了吗?我对这一部分,很困惑,期望能够得到你的帮助!
再次感谢,打扰到你的时间了,谢谢!)

@shams2023
Copy link

你好,我承认这是一份好工作。 然而,在代码中,你设置了batch_size=256,但论文却说是128(也许我下载的论文版本不对?我是从arxiv下载的)。 我复现了代码,发现当batch_size=256时,在msrvtt-1ka上的准确率与论文中相当,但当batch_size=128时,只有47%左右。

Can you send this out for me to refer to? I don't know how to define the storage location of the variables here. As shown in the following figure: (It can also be co_train_msrvtt. sh)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants