Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: OLE CF and VBA modules implemented #274

Merged
merged 4 commits into from
Jan 14, 2025
Merged

Conversation

davidmagnotti
Copy link
Contributor

OLE CF (Object Linking and Embedding Compound File) format is a file format used for legacy Microsoft Office files, such as documents, workbooks, presentations, and others. It's also used with Visual Basic for Applications (VBA) which is known more commonly as Office macros.

I've implemented two modules for parsing OLE CF files and VBA. I've also expanded the dump command to allow dumping of file metadata such as stream metadata (OLE CF) and macros (VBA).

An example of how you could use this to identify the use of auto-execute macro method names like "Document_New":

import "vba"

rule detect_document_new
{
    condition:
        for any module in vba.module_code : (
            module matches /document_new/i
        )
}

- Added support for parsing OLE CF and VBA (macro-enabled Office) files.
Copy link
Member

@plusvic plusvic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. This goes in the right direction, but I think it requires a bit of rewrite to make it more similar to other existing modules that use the nom crate for parsing binary files. The nom crate makes parsing complex data strucuture easier, and removes boilerplate code like the read_u32 and read_u16 functions.

lib/src/modules/vba/mod.rs Outdated Show resolved Hide resolved
lib/src/modules/olecf/parser.rs Outdated Show resolved Hide resolved
lib/src/modules/olecf/mod.rs Show resolved Hide resolved
repeated string module_names = 2;

// Type of each module (standard, class, form)
repeated string module_types = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using constant strings use enums.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually consistent with existing modules like Lnk. We're not using an enum because it's just integer typing for metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for this pattern in the lnk module and didn't find it. In most cases, fields that have a value from a pre-defined set of possible values, are expressed as enums instead of strings. Using strings have the downside that you must remember which are the possible values. In this case the possible values are: "Standard", "Class" and "Unknown". If you forget that they start with a capital letter you can end writing a rule like module_type == "standard" which won't match.

In the other hand, enums allow using the constants defined by the module itself (as in module_type == vbs.ModuleType.Standard), which reduce errors.

repeated string module_types = 3;

// The actual VBA code for each module
repeated string module_code = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually repeated? If so, the name is a bit misleading considering that module_names and module_types are in plural form and this is singular.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking more carefully at the code, it looks like module_code (now module_codes) is actually an array where each item is a line of code. IMHO the module's code should be all together in a single string with newlines. Having the code separated in lines make harder to work with it. For instance, if you want to search for some pattern in the code you will need a loop that looks for the pattern one line at a time. I don't see any benefit in having the code split in lines.

@davidmagnotti
Copy link
Contributor Author

Thank you. While I can address most of the comments in another revision soon, I've been experimenting with using nom and realistically I won't have bandwidth to refactor the code from the current read_u16/read_u32 approach used in each module to use nom any time soon. Any chance we may be able to merge and address in a refactor later?

@plusvic
Copy link
Member

plusvic commented Dec 30, 2024

I prefer to maintain a separate development branch until the nom rewrite is done and not merging the code until it gets a bit more stable. I think I can the nom part in a few weeks.

- Adjusted modules to use `nom`.
- Added links to specifications.
- Fixed issue with copying entire input data in the VBA module.
@davidmagnotti
Copy link
Contributor Author

I prefer to maintain a separate development branch until the nom rewrite is done and not merging the code until it gets a bit more stable. I think I can the nom part in a few weeks.

I (unexpectedly) found some more time to put into this :-) does the re-work match your expectations?

@plusvic plusvic merged commit c0f5926 into VirusTotal:main Jan 14, 2025
3 checks passed
@plusvic
Copy link
Member

plusvic commented Jan 14, 2025

Oops, I made a mistake and merged this PR inadvertenly and then restored the main branch to its previous state. Now I can't re-open this PR, can you crate a new PR with the canges?

@davidmagnotti
Copy link
Contributor Author

Oops, yes can do.

@davidmagnotti
Copy link
Contributor Author

Done: #285

@plusvic
Copy link
Member

plusvic commented Jan 14, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants