Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Grover's Algorithm using Lightning-Qubit's C++ API #980

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

jzaia18
Copy link

@jzaia18 jzaia18 commented Nov 5, 2024

Context:
Implements Grover's algorithm as a standalone C++ file in completion of a hiring test assignment. Does not affect any code within pennylane-lightning itself.

Description of the Change:
Adds directory jzaia_files and file jzaia_files/main.cpp which fully implement Grover's algorithm. This file directly interfaces with pennylane's lightning-qubit C++ API and implements Grover's algorithm for any 1-state selection oracle.

Benefits:
N/A

Possible Drawbacks:
N/A

Related GitHub Issues:
In completion of the assignment given by #963

@jzaia18
Copy link
Author

jzaia18 commented Nov 5, 2024

I can't add reviewers manually, but @tomlqc this is the first step. Working on benchmarking against a Python+Pennylane implementation now.

@jzaia18
Copy link
Author

jzaia18 commented Nov 6, 2024

I wasn't able to get my version to build with AVX instructions, nor was I able to successfully disable whatever type of acceleration was happening in Python. Nonetheless, here are my benchmarking results.

All tests are run on my personal computer, which runs Arch Linux and has an AMD Ryzen 7 2700X (8 cores, 16 threads) and 48GB of RAM. A minimal number of other processes were running during all benchmarks, and should not significantly impact results.

All tests involve running Grover's algorithm using a 6-qubit, 10-qubit, and 17-qubit oracle. Every oracle selects precisely 1 state, requiring ~sqrt(2^(n-1)) repetitions.

Custom C++ Implementation Results:

Time to run oracle 1: 0ms
Time to run oracle 2: 11ms
Time to run oracle 3: 26356ms

Top lines from perf:

Overhead  Command    Shared Object         Symbol
  24.73%  lq_grover  lq_grover             [.] std::complex<double> std::operator*<double>(double const&, std::complex<double> const&)
  19.77%  lq_grover  lq_grover             [.] std::complex<double>::operator*=(double)
  15.05%  lq_grover  lq_grover             [.] Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNCHadamard<double>(std::complex<double>*, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<bool, std::allocator<bool> > const
   9.02%  lq_grover  lq_grover             [.] void Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNC1<double, double, Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNCHadamard<double>(std::complex<double>*, unsigned long, std::vector<unsigned long> ...
   8.46%  lq_grover  lq_grover             [.] std::complex<double> std::operator-<double>(std::complex<double> const&, std::complex<double> const&)
   7.99%  lq_grover  lq_grover             [.] std::complex<double> std::operator+<double>(std::complex<double> const&, std::complex<double> const&)
   5.79%  lq_grover  lq_grover             [.] std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&)
   5.74%  lq_grover  lq_grover             [.] std::complex<double>& std::complex<double>::operator-=<double>(std::complex<double> const&)
   1.97%  lq_grover  lq_grover             [.] std::complex<double>::__rep() const
   0.87%  lq_grover  lq_grover             [.] Pennylane::Util::exp2(unsigned long const&)

From this, it's clear that the vast majority of time is being spent on multiplying state amplitudes as an internal part of pennylane-lightning. Applying gates also takes significant execution time, although it is likely the case that the specific part of the gate application which takes so much time is the aforementioned complex amplitude multiplications.

Python Implementation using Pennylane's lightning.qubit Results:

Time to run oracle 1: 44ms
Time to run oracle 2: 12ms
Time to run oracle 3: 804ms

Note that the time to run oracle 1 is much larger than oracle 2 despite its circuit being smaller. This is likely due to overhead from the python interpreter the first time it executes the functions in this script. By pre-running kernel 1 so that the interpreter can run all the functions once, the execution time of oracle 1 (the 2nd time it is encountered) drops to 2ms.

Top lines from cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.001    0.000    0.290    0.032 decorators.py:50(wrapper_entry)
        3    0.002    0.001    0.229    0.076 qnode.py:836(construct)
        3    0.000    0.000    0.191    0.064 grover.py:52(run_grovers)
9866/9300    0.013    0.000    0.184    0.000 capture_meta.py:81(__call__)
        6    0.000    0.000    0.163    0.027 transform_program.py:492(__call__)
      283    0.007    0.000    0.158    0.001 grover.py:33(grovers_mirror)
233885/233819    0.073    0.000    0.157    0.000 {built-in method builtins.isinstance}
        3    0.000    0.000    0.151    0.050 _adjoint_jacobian.py:75(calculate_jacobian)
        3    0.002    0.001    0.151    0.050 _adjoint_jacobian_base.py:92(_process_jacobian_tape)
        3    0.028    0.009    0.136    0.045 _serialize.py:370(serialize_ops)
     9270    0.011    0.000    0.120    0.000 operation.py:1862(__init__)
     9270    0.027    0.000    0.109    0.000 operation.py:1098(__init__)
   210226    0.040    0.000    0.084    0.000 <frozen abc>:117(__instancecheck__)
        9    0.000    0.000    0.079    0.009 preprocess.py:265(decompose)
       15    0.003    0.000    0.079    0.005 {built-in method builtins.all}
    27729    0.007    0.000    0.076    0.000 preprocess.py:365(<genexpr>)
     9240    0.013    0.000    0.068    0.000 _serialize.py:402(get_wires)
        6    0.000    0.000    0.061    0.010 qnode.py:649(_update_gradient_fn)
        6    0.000    0.000    0.061    0.010 qnode.py:672(get_gradient_fn)
        6    0.000    0.000    0.060    0.010 lightning_qubit.py:488(supports_derivatives)
        3    0.000    0.000    0.060    0.020 lightning_qubit.py:220(_supports_adjoint)
      566    0.002    0.000    0.059    0.000 controlled_ops.py:1093(__init__)

Here, the predominant time sink interestingly seems to be constructing the circuit itself and not executing it. This makes sense, since as I mentioned before, I was not able to properly disable any sort of acceleration. It's likely that this implementation is using an underlying gate implementation which is much better for the given hardware.

Overall conclusion:
The python implementation significantly outperforms the C++ implementation for larger circuits. This is likely due to poor choice of CPU kernel on the part of the C++ code. However, the C++ implementation slightly outperforms the python implementation for small circuits. This is likely due to the expensive overhead of constructing circuits as python objects, which for small enough circuits dwarfs the execution time of the circuit. If I were to try to improve the C++ implementation, I would make it compile with the best possible CPU kernels for the given machine, and would expect to see it outperform the python implementation (assuming they use the same kernels).

@jzaia18 jzaia18 marked this pull request as ready for review November 6, 2024 21:51
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.53%. Comparing base (0419cdb) to head (a5ecaa0).

❗ There is a different number of reports uploaded between BASE (0419cdb) and HEAD (a5ecaa0). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (0419cdb) HEAD (a5ecaa0)
8 4
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #980       +/-   ##
===========================================
- Coverage   95.47%   28.53%   -66.94%     
===========================================
  Files         221       28      -193     
  Lines       33055     2516    -30539     
===========================================
- Hits        31559      718    -30841     
- Misses       1496     1798      +302     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tomlqc tomlqc requested a review from AmintorDusko November 7, 2024 16:37
@AmintorDusko AmintorDusko linked an issue Nov 7, 2024 that may be closed by this pull request
Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your hard work @jzaia18.
This is the first batch of questions
Let me know your thoughts.

circ = qml.QNode(run_grovers, dev)

expvals = circ(oracle, num_qubits)
results = [int(val.numpy() < 0) for val in expvals]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of .numpy() suggests that you are running your code with Torch or Tensorflow, for example. Is this the case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be. The python virtualenv I'm using does not have tensorflow nor torch installed. The return type I was getting from running the circuit was a list of pennylane.numpy.tensor.tensor. All of these tensors are 0-dimensional so I used .numpy() to convert these to a scalar value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I take your python file grover.py and run, as it is, in a fresh environment with Python 3.10, where I only installed requirements-dev.txt, I'm getting the following error message:

Traceback (most recent call last):
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 114, in <module>
    run_experiment(*gen_oracle(0))
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 91, in run_experiment
    results = [int(val.numpy() < 0) for val in expvals]
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 91, in <listcomp>
    results = [int(val.numpy() < 0) for val in expvals]
AttributeError: 'float' object has no attribute 'numpy'

Would you know why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you make all sensible updates, could you please re-run your benchmarks in a new and fresh environment?
Please let us know about your results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm seeing this too. It seems the issue was that I originally installed my dependencies from requirements.txt instead of requirements-dev.txt. It seems there are only a few differences, but namely the dev version installs Pennylane from source, so this is almost certainly related to that. Fixed in 8bd1f93

Comment on lines 106 to 130
/**
* @brief The first testing oracle for Grover's
*
* A 6-qubit test oracle for Grover's algorithm. Applies a pauli-X to the
* rightmost qubit if the leftmost 5 qubits are in the state |11010>
*
* @param sv The statevector to apply the oracle to. Must be 6 qubits
*/
void oracle1(StateVectorLQubitManaged<double> &sv) {
// Sanity check statevector
assert(sv.getNumQubits() == ORACLE1_QUBITS);

// Define controls to be used for applying the X gate
std::vector<size_t> controls(ORACLE1_QUBITS-1);
std::iota(controls.begin(), controls.end(), 0);
std::vector<bool> control_vals = ORACLE1_EXPECTED;

// Apply the X gate to the ancilla, controlled on the chosen bitstring
GateImplementationsLM::applyNCPauliX(sv.getData(),
sv.getNumQubits(),
controls,
control_vals,
{ORACLE1_QUBITS-1},
false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a way to have a function Oracle defined a single time and that we could re-use for the three test cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I originally didn't do this so I could support more complex oracles that might not follow the same pattern (for example, one which selects multiple states by using multiple smaller controlled nots rather than just doing a single all-qubit controlled not). After some discussion in the github issue thread, I decided not to pursue other oracles, but I left this pattern as such. Since I made the python version later on, I used a function which generates other functions there to condense everything. Something similar could be done in the C++ version using preprocessor commands, or even more simply by just having a single oracle function which takes the desired control qubits as a parameter.

Comment on lines 208 to 213
for (size_t reps = nreps; reps > 0; --reps) {
// Apply the oracle to apply a phase of -1 to desired state
oracle(sv);
// Perform amp-amp by reflecting over |+++...+>
groversMirror(sv);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to rewrite this with STL function(s)? What are the implications?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not as familiar with C++ as I am with plain C, so I had to look into some STL options. It seems like std::for_each or std::for_each_n might be good to use here, but there isn't a natural collection to loop over. There might be a readability benefit to something like this, or an immutability guarantee for something with more functional style (although in this case, we are using these functions for their side effects). But honestly, I would be very surprised if there is anything that gets a speed advantage over a plain for loop for something like this, compiler optimization should be very good for this.

Comment on lines 221 to 225
for (size_t obs_wire=0; obs_wire < num_qubits - 1; ++obs_wire) {
NamedObs<StateVectorLQubitManaged<double>> obs("PauliZ", {obs_wire});
double result = Measurer.expval(obs);
common_result[obs_wire] = (result < 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be rewritten in a functional (STL) way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same disclaimer as the prior comment: To be honest, I'm not as familiar with C++ as I am with plain C, so I had to look into some STL options.

But after a quick glance through some fuilt-in functions. It looks like the transform function is the one for the job here. I rewrote that snippet (plus a few lines before it) as:

// Set up measurements
Measurements<StateVectorLQubitManaged<double>> Measurer(sv);

std::vector<size_t> wires(num_qubits - 1);
std::iota(wires.begin(), wires.end(), 0);
// Vector to store the most common measurement outcome
std::vector<bool> common_result(num_qubits - 1, false);

std::transform(wires.begin(), wires.end(), common_result.begin(), [&Measurer](size_t wire){
    NamedObs<StateVectorLQubitManaged<double>> obs("PauliZ", {wire});
    return Measurer.expval(obs) < 0;
});

I've written functional programs before, in languages like Rust, SML, and Lisp (even arguably some python with lambdas). I'm not as used to functional programming in C++, but it seems like it would be relatively easy to pick up.

Comment on lines 249 to 258
// Run experiment 1: 11010
std::cout << "Running Oracle 1. Expected: " << ORACLE1_EXPECTED << std::endl;
auto start_time = high_resolution_clock::now();

run_experiment(oracle1, ORACLE1_QUBITS);

const duration time_oracle1 = duration_cast<milliseconds>(
high_resolution_clock::now() - start_time);

std::cout << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you can rewrite these code blocks in terms of a loop over the three chosen oracles?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, the Python approach does this. I'm not sure why this didn't occur to me earlier, but I could have just done this as a loop over pairs of the oracle function pointers and the number of qubits needed for each statevector. That would have definitely been cleaner code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was just reading back over this thread and I'm realizing I probably misinterpreted this question as "can this be done" rather than "could you do this", I'll make some changes now and push them to my fork.

Copy link
Author

@jzaia18 jzaia18 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have gone and done this in the most recent push. Actually, I also made the change from your other comment and merged the 3 oracle functions into 1. See 5057894

@jzaia18
Copy link
Author

jzaia18 commented Nov 7, 2024

Thanks @AmintorDusko ! I think I replied to all of them thus far, let me know if you have anything else I can answer. Thanks for taking the time to read through everything and leave thoughtful feedback.

Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your quick answers. I have a few more questions. If my questions/suggestions make sense please try to implement them. Several CIs are falling. Would you be able to go over it and check why? [You don't need to fix it] In special Codefactor is complaining about formatting. Could you please placate their fury by implementing their suggestions?
Edit: Codecov to Codefactor

Comment on lines 21 to 36
// Define values to be selected by each oracle
#define ORACLE1_QUBITS (6)
/* Oracle 1 selects the string: "11010" */
#define ORACLE1_EXPECTED (std::vector<bool>{true, true, false, true, false})

#define ORACLE2_QUBITS (10)
/* Oracle 2 selects the string: "101010101" */
#define ORACLE2_EXPECTED (std::vector<bool>{true, false, true, false, true, \
false, true, false, true})

#define ORACLE3_QUBITS (17)
/* Oracle 2 selects the string: "0011001100110011" */
#define ORACLE3_EXPECTED (std::vector<bool>{false, false, true, true, false, \
false, true, true, false, false, \
true, true, false, false, true, \
true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this without macros?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I changed these to global consts in ff86a5e, is that what you meant? Or do you prefer to stay away from those as well and just define these directly in main? I know sometimes formatting guidelines are opinionated on whether or not globals are good or bad design, but I'm pretty neutral on them.

Comment on lines 110 to 111
* A 6-qubit test oracle for Grover's algorithm. Applies a pauli-X to the
* rightmost qubit if the leftmost 5 qubits are in the state |11010>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not up-to-date right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e466e97

Comment on lines 167 to 178
std::vector<size_t> wires(num_qubits - 1);
std::iota(wires.begin(), wires.end(), 0);
std::vector<bool> common_result(num_qubits - 1, false);

// Perform a "measurement" by taking the expected value of this qubit over multiple runs
std::transform(wires.begin(),
wires.end(),
common_result.begin(),
[&Measurer](size_t wire){
NamedObs<StateVectorLQubitManaged<double>> obs("PauliZ", {wire});
return Measurer.expval(obs) < 0;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you execute this same logic without creating and populating an extra vector wires?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed in ca6634c. Thanks for pointing these out, I did not realize my C++ STL knowledge had so many gaps, but this has been a great way to patch them up. The functional style is quite elegant.

Comment on lines 157 to 162
for (size_t reps = nreps; reps > 0; --reps) {
// Apply the oracle to apply a phase of -1 to desired state
oracle(sv, expected);
// Perform amp-amp by reflecting over |+++...+>
groversMirror(sv);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about looping forward instead of backward?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also be completely reasonable. Originally nreps was inline and not its own variable (i.e reps = <the code that currently initializes nreps>), and I formulated the loop this way so that it wouldn't recompute nreps on every loop iteration (though the compiler would likely optimize this away if it was able anyway). In it's current state, forward vs backward iteration shouldn't make any difference, even including bad compiler optimization. Changed to forward iteration for readability in 0066b3b.

circ = qml.QNode(run_grovers, dev)

expvals = circ(oracle, num_qubits)
results = [int(val.numpy() < 0) for val in expvals]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I take your python file grover.py and run, as it is, in a fresh environment with Python 3.10, where I only installed requirements-dev.txt, I'm getting the following error message:

Traceback (most recent call last):
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 114, in <module>
    run_experiment(*gen_oracle(0))
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 91, in run_experiment
    results = [int(val.numpy() < 0) for val in expvals]
  File "/home/amintor/Projects/pennylane-lightning/prototypes/jzaia.py", line 91, in <listcomp>
    results = [int(val.numpy() < 0) for val in expvals]
AttributeError: 'float' object has no attribute 'numpy'

Would you know why?

@jzaia18
Copy link
Author

jzaia18 commented Nov 8, 2024

All Codefactor checks are passing now (I'm surprised pylint didn't already yell at me for some of its suggestions, I'll have to check my own config haha). The remaining CI failures are probably caused by some of my changes to the cmake files to get my standalone file to build. I'm unsure what else I would have done that would cause build failures in the primary part of the codebase. Digging in a bit I'm seeing cmake error messages like

CMake Error at CMakeLists.txt:130 (add_subdirectory):
  add_subdirectory given source "jzaia_files" which is not an existing
  directory.

so I'm pretty confident. Code coverage is also significantly down. My own code should have 0% coverage since it has no unit tests (it is itself a manual test, but the coverage tool wouldn't know that), but it seems build failures or something similar is also drastically reducing code coverage elsewhere. Will re-run the benchmarks now and should be posting them shortly. Let me know if there's anything else you'd like me to follow up on!

@jzaia18
Copy link
Author

jzaia18 commented Nov 8, 2024

I've rerun the benchmarks and posted the results below. Unfortunately, I'm on campus at the moment and away from my primary computer so these are run on my laptop. If a more direct comparison is desirable I can re-run the benchmarks again on my primary computer when I get home later today. These tests are being run on a laptop running Manjaro Linux, with an AMD Ryzen 5 4500U and 8GB of RAM. The oracles being run are the same ones as the prior test.

Custom C++ Implementation Results:

Time to run oracle 1: 0ms
Time to run oracle 2: 8ms
Time to run oracle 3: 20339ms

Top lines from perf:

Overhead  Command    Shared Object         Symbol
  26.24%  lq_grover  lq_grover             [.] std::complex<double> std::operator*<double>(double const&, std::complex<double> const&)
  15.72%  lq_grover  lq_grover             [.] Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNCHadamard<double>(std::complex<double>*, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> > const&, std
  12.93%  lq_grover  lq_grover             [.] std::complex<double> std::operator-<double>(std::complex<double> const&, std::complex<double> const&)
  12.69%  lq_grover  lq_grover             [.] std::complex<double> std::operator+<double>(std::complex<double> const&, std::complex<double> const&)
  11.40%  lq_grover  lq_grover             [.] std::complex<double>::operator*=(double)
   8.08%  lq_grover  lq_grover             [.] std::complex<double>& std::complex<double>::operator-=<double>(std::complex<double> const&)
   7.49%  lq_grover  lq_grover             [.] std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&)
   2.64%  lq_grover  lq_grover             [.] void Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNC1<double, double, Pennylane::LightningQubit::Gates::GateImplementationsLM::applyNCHadamard<double>(std::complex<double>*
   1.59%  lq_grover  lq_grover             [.] std::complex<double>::__rep() const
   0.71%  lq_grover  lq_grover             [.] Pennylane::Util::exp2(unsigned long const&)

Many of these lines seem the same as before, though their relative order has shifted. It's hard for me to be confident that this is a result of any changes made on account of running on different hardware. However, I would be surprised if any of the changes made resulted in a runtime difference, as they were mostly aesthetic or readability changes. The only reason I might expect a performance difference is if the refactor somehow allowed the compiler to make better optimizations.

Python Implementation using Pennylane's lightning.qubit Results:

Time to run oracle 1: 2ms
Time to run oracle 2: 9ms
Time to run oracle 3: 478ms

NOTE: This is being run with the pre-run on oracle 1 still enabled, and without running cProfile, which might degrade wall-clock time performance.

Top lines from cProfile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.001    0.000    0.281    0.031 decorators.py:50(wrapper_entry)
        3    0.000    0.000    0.228    0.076 qnode.py:842(construct)
        3    0.000    0.000    0.197    0.066 grover.py:60(run_grovers)
9866/9300    0.011    0.000    0.190    0.000 capture_meta.py:81(__call__)
      283    0.006    0.000    0.162    0.001 grover.py:41(grovers_mirror)
        9    0.000    0.000    0.150    0.017 transform_program.py:492(__call__)
     9270    0.009    0.000    0.126    0.000 operation.py:1847(__init__)
     9270    0.024    0.000    0.117    0.000 operation.py:1110(__init__)
205092/205026    0.053    0.000    0.113    0.000 {built-in method builtins.isinstance}
        9    0.000    0.000    0.071    0.008 preprocess.py:294(decompose)
       12    0.003    0.000    0.071    0.006 {built-in method builtins.all}
    27729    0.007    0.000    0.068    0.000 preprocess.py:396(<genexpr>)
      566    0.002    0.000    0.063    0.000 controlled_ops.py:1093(__init__)
   167974    0.029    0.000    0.060    0.000 <frozen abc>:117(__instancecheck__)
    16637    0.006    0.000    0.059    0.000 wires.py:131(__init__)
        6    0.000    0.000    0.053    0.009 qnode.py:676(get_gradient_fn)
    13804    0.023    0.000    0.053    0.000 wires.py:44(_process)
        6    0.000    0.000    0.053    0.009 lightning_qubit.py:488(supports_derivatives)
        3    0.000    0.000    0.053    0.018 lightning_qubit.py:220(_supports_adjoint)
     9270    0.004    0.000    0.047    0.000 operation.py:1492(queue)
     9866    0.010    0.000    0.046    0.000 queuing.py:306(append)
    18480    0.008    0.000    0.045    0.000 lightning_qubit.py:234(_adjoint_ops)

This is also largely the same. In fact this should be expected since the Python code is more-or-less identical to before the recent changes.

Overall conclusion:
As before, the Python implementation outperforms the C++ implementation for sufficiently large circuits. I still suspect this is due to choosing better kernels for gate implementations. For smaller circuits, the construction time is slow enough that the C++ still manages to outperform the Python code, even despite using worse kernels for running the circuit. The benchmarking results are minimally impacted by the code changes, which should be expected since most of the code changes were focused on improving readability rather than optimizing performance.

Thanks again for taking the time to leave thoughtful feedback on this @AmintorDusko , I really appreciate it!

@AmintorDusko
Copy link
Contributor

@jzaia18, thank you for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Grover's Algorithm using Lightning-Qubit's C++ API
2 participants