Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hackathon 2024][Indium][Python] Avoid basic REGEX usages #119

Open
max-perrin opened this issue May 30, 2024 · 0 comments
Open

[Hackathon 2024][Indium][Python] Avoid basic REGEX usages #119

max-perrin opened this issue May 30, 2024 · 0 comments
Assignees
Labels
Hackathon 2024 New issues tagged during the hackathon 2024 spotter

Comments

@max-perrin
Copy link

Rule title

Avoid basic REGEX usages.

Language and platform

PoC made in Python, but can be applied the same way to PHP and Java too.

Rule description

Using regex methods for basic string manipulations is not time efficient.
Prefer the usage of string methods such as startswith, endswith, or in operator, which are faster.

Noncompliant Code Example

string = 'abcdef'
if re.search(r'^abc', string):
    print('string starts with abc')

Compliant Solution

string = 'abcdef'
if string.startswith('abc'):
    print('string starts with abc')

Rule short description

Avoid using REGEX for basic string manipulation.

Rule justification

We measured the execution time using the time module in Python. The resource used was a 1.1 million word list found online. To obtain representative results, tests were performed several times.

Noncompliant Code

prefix = 'te'
regex = re.compile(fr'^{prefix}')
with open('1.1million word list.txt', 'r', encoding='utf-8') as file:
    count = 0
    for word in file:
        if regex.search(word) is not None:
            count += 1

Compliant Code

prefix = 'te'
with open('1.1million word list.txt', 'r', encoding='utf-8') as file:
    count = 0
    for word in file:
        if word.startswith(prefix):
            count += 1

We search the 1.1 million word list to find strings starting with 'te', using regex.search or string.startswith. The test was done 5 time for each method, giving the results below:

using regex search

N° of iteration Time (ms)
1 560.47
2 670.64
3 675.09
4 831.12
5 517.98
Average 651.06

using string startswith

N° of iteration Time (ms)
1 318.57
2 259.16
3 287.76
4 253.95
5 245.64
Average 273.02

Conclusion: for this test session, using string manipulation was on average 2.4x faster than using regex. More tests should be done to study the energy consuption, but it should be proportionate to the execution time.

Severity / Remediation Cost

Estimate the severity and remediation cost of your issue.

  • Severity: Minor - the impact of the bad practice is not that important unless the volume of searched strings is very important (very uncommon, but that may happen).
  • Remediation cost: Easy - alternative energy efficient functions exists in Python, Java, PHP and most other languages. The code requires a little refactor but not very complicated.

Implementation principle

  • Search for regex containing no special character
  • Search for regex containing ^ and/or $ anchors, and re.search, re.match and re.compile methods
@max-perrin max-perrin added Hackathon 2024 New issues tagged during the hackathon 2024 spotter labels May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hackathon 2024 New issues tagged during the hackathon 2024 spotter
Projects
None yet
Development

No branches or pull requests

2 participants